## Abstract

This thesis addresses the problem of machine learning from biased datasets in the context of astronomical applications. In astronomy there are many cases in which the training sample does not follow the true distribution. The thesis examines different types of biases and proposes algorithms to handle them.

During learning and when applying the predictive model, active learning enables algorithms to select training examples from a pool of unlabeled data and to request the labels. This allows for selecting examples that maximize the algorithm's accuracy despite an initial bias in the training set. Against this background, the thesis begins with a survey of active learning algorithms for the support vector machine.

If the cost of additional labeling is prohibitive, unlabeled data can often be utilized instead and the sample selection bias can be overcome through domain adaptation, that is, minimizing the discrepancy between training sample and the true distribution. A simple method consists of weighting the elements of the training sample such that the empirical risk becomes an unbiased estimator of the true distribution's risk. The respective weights can be computed as the probability density ratio of training and test distribution. A model selection criterion—which is known in the context of kernel-based weight estimators—is proposed to be combined with a nearest neighbor density ratio estimator. It is shown to compare favorably to alternative approaches when applied to large-scale problems with low-dimensional feature spaces: a common setting in astronomical applications such as photometric redshift estimation.

Another form of bias stems from label noise. This thesis considers the scenario in which unreliable labels can be replaced by highly accurate labels at a certain cost. This is, for example, the case in crowd-sourcing, where unreliable labelers can be corrected by experts, or in astronomy, where a labeling based on photometric data can be improved by spectroscopic observations. An algorithm to actively select objects for correction under a limited re-labeling budget is presented. It is shown empirically to converge faster to the maximally attainable accuracy than the state-of-the-art.

During learning and when applying the predictive model, active learning enables algorithms to select training examples from a pool of unlabeled data and to request the labels. This allows for selecting examples that maximize the algorithm's accuracy despite an initial bias in the training set. Against this background, the thesis begins with a survey of active learning algorithms for the support vector machine.

If the cost of additional labeling is prohibitive, unlabeled data can often be utilized instead and the sample selection bias can be overcome through domain adaptation, that is, minimizing the discrepancy between training sample and the true distribution. A simple method consists of weighting the elements of the training sample such that the empirical risk becomes an unbiased estimator of the true distribution's risk. The respective weights can be computed as the probability density ratio of training and test distribution. A model selection criterion—which is known in the context of kernel-based weight estimators—is proposed to be combined with a nearest neighbor density ratio estimator. It is shown to compare favorably to alternative approaches when applied to large-scale problems with low-dimensional feature spaces: a common setting in astronomical applications such as photometric redshift estimation.

Another form of bias stems from label noise. This thesis considers the scenario in which unreliable labels can be replaced by highly accurate labels at a certain cost. This is, for example, the case in crowd-sourcing, where unreliable labelers can be corrected by experts, or in astronomy, where a labeling based on photometric data can be improved by spectroscopic observations. An algorithm to actively select objects for correction under a limited re-labeling budget is presented. It is shown empirically to converge faster to the maximally attainable accuracy than the state-of-the-art.

Original language | English |
---|

Publisher | Department of Computer Science, Faculty of Science, University of Copenhagen |
---|---|

Number of pages | 91 |

Publication status | Published - 2016 |