Decontamination of Mutually Contaminated Models

International audience A variety of machine learning problems are characterized by data sets that are drawn from multiple different convex combinations of a fixed set of base distributions. We call this a mutual contamination model. In such problems, it is often of interest to recover these base dis...

Full description

Bibliographic Details
Main Authors: Blanchard, Gilles, Scott, Clayton
Other Authors: Institut für Mathematik Potsdam, University of Potsdam = Universität Potsdam, University of Michigan Ann Arbor, University of Michigan System, Samuel Kaski, Jukka Corander
Format: Conference Object
Language:English
Published: HAL CCSD 2014
Subjects:
Online Access:https://hal.archives-ouvertes.fr/hal-03371264
https://hal.archives-ouvertes.fr/hal-03371264/document
https://hal.archives-ouvertes.fr/hal-03371264/file/blanchard14-supp-pdfjam.pdf
Description
Summary:International audience A variety of machine learning problems are characterized by data sets that are drawn from multiple different convex combinations of a fixed set of base distributions. We call this a mutual contamination model. In such problems, it is often of interest to recover these base distributions, or otherwise discern their properties. This work focuses on the problem of classification with multiclass label noise, in a general setting where the noise proportions are unknown and the true class distributions are nonseparable and potentially quite complex. We develop a procedure for decontamination of the contaminated models from data, which then facilitates the design of a consistent discrimination rule. Our approach relies on a novel method for estimating the error when projecting one distribution onto a convex combination of others, where the projection is with respect to a statistical distance known as the separation distance. Under sufficient conditions on the amount of noise and purity of the base distributions, this projection procedure successfully recovers the underlying class distributions. Connections to novelty detection, topic modeling, and other learning problems are also discussed.