Abstract
The learning of a pattern classification rule rests on acquiring information
to constitute a decision rule that is close to the optimal Bayes rule.
Among the various ways of conveying information, showing the learner
examples from the different classes is an obvious approach and ubiquitous
in the pattern recognition field. Basically there are two types of
examples: labeled in which the learner is provided with the correct
classification of the example and unlabeled in which this classification
is missing. Driven by the reality that often unlabeled examples are
plentiful whereas labeled examples are difficult or expensive to acquire
we explore the tradeoff between labeled and unlabeled sample complexities
(the number of examples required to learn to within a specified error),
specifically getting a quantitative measure of the reduction in the
labeled sample complexity as a result of introducing unlabeled examples.
This problem was posed in this form by T. M. Cover and may be succinctly,
if inexactly, stated as follows: How many unlabeled examples is one
labeled example worth?
The direction taken in this dissertation focuses on the archetypal
problem of learning a classification problem with two pattern classes
that are typified by fea ture vectors, i.e., examples drawn from class
conditional Gaussian distributions and where the learning approaches
are parametric and nonparametric. Denoting the dimensionality of the
example-space as $N$, and the number of labeled and unlabeled examples
as $m$ and $n$ respectively, then for specific algorithms, it is
shown that under a nonparametric scenario the classification error
probability decreases roughly as $O\left(\left(c_{0}n^{-2/N}\right)^{\log N}\right)+O\left(e^{-c_{1}m}\right)$,
and in the parametric scenario the error decreases roughly as $O\left(N^{3/5}n^{-1/5}\right)+O\left(e^{-c_{1}m}\right)$.
where $c_{0}$, $c_{1}>0$ are constants with respect to $N$, $m$
and $n$. This shows that in both the parametric and nonparametric
cases it takes roughly exponentially more unlabeled examples than
labeled examples for the same reduction in error. When considering
the effect of the dimensionality $N$, roughly speaking, a labeled
example is worth exponentially more in the nonparametric than in the
parametric scenario.
The parametric approach uses the Maximum Likelihood technique with
labeled and unlabeled samples to construct a decision rule estimate.
In this scenario the learner knows the parametric form of the pattern
class densities. Sufficient finite sample complexities are established
by which the value of one labeled example in terms of the number of
unlabeled examples is determined to be polynomial in the dimensionality
$N$. The analysis may provide the details for broadening the results
to other non Gaussian parametric based families of problems. An extension
to the case of different a priori class probabilities is investigated
under this parametric scenario, and for the non-unit covariance Gaussian
problem it is conjectured that the value of a labeled example is still
polynomial in $N$.
In the nonparametric scenario the primary focus is on an algorithm
which is based on Kernel Density Estimation. It uses a mixed sample
to construct a decision rule where now the learner has significantly
less side information about the class den sities. The finite sample
complexities for learning the Gaussian based problem are established
by which the value of one labeled example is determined to be exponential
in the dimensionality $N$. An extension to a larger family of nonparametric
classification problems is provided where the same tradeoff applies.
A variant of this approach is investigated in which only a finite
number of functional values of the underlying mixture density are
estimated. This yields a smaller tradeoff but is still exponential
in $N$. The mixed sample complexities for the classical $k$-means
clustering procedure are also determined.
An experimental investigation using neural networks examines the value
of a labeled example when learning a classification problem based
on a Gaussian mixture. For other classification problems, the cost
of learning measured by the labeled sample size as a function of the
dimensionality $N$, is shown to be lower for a two-layer network
than with the regular single layer Kohonen network. This is attributed
to the better discrimination ability of the partition of the classifier.
to constitute a decision rule that is close to the optimal Bayes rule.
Among the various ways of conveying information, showing the learner
examples from the different classes is an obvious approach and ubiquitous
in the pattern recognition field. Basically there are two types of
examples: labeled in which the learner is provided with the correct
classification of the example and unlabeled in which this classification
is missing. Driven by the reality that often unlabeled examples are
plentiful whereas labeled examples are difficult or expensive to acquire
we explore the tradeoff between labeled and unlabeled sample complexities
(the number of examples required to learn to within a specified error),
specifically getting a quantitative measure of the reduction in the
labeled sample complexity as a result of introducing unlabeled examples.
This problem was posed in this form by T. M. Cover and may be succinctly,
if inexactly, stated as follows: How many unlabeled examples is one
labeled example worth?
The direction taken in this dissertation focuses on the archetypal
problem of learning a classification problem with two pattern classes
that are typified by fea ture vectors, i.e., examples drawn from class
conditional Gaussian distributions and where the learning approaches
are parametric and nonparametric. Denoting the dimensionality of the
example-space as $N$, and the number of labeled and unlabeled examples
as $m$ and $n$ respectively, then for specific algorithms, it is
shown that under a nonparametric scenario the classification error
probability decreases roughly as $O\left(\left(c_{0}n^{-2/N}\right)^{\log N}\right)+O\left(e^{-c_{1}m}\right)$,
and in the parametric scenario the error decreases roughly as $O\left(N^{3/5}n^{-1/5}\right)+O\left(e^{-c_{1}m}\right)$.
where $c_{0}$, $c_{1}>0$ are constants with respect to $N$, $m$
and $n$. This shows that in both the parametric and nonparametric
cases it takes roughly exponentially more unlabeled examples than
labeled examples for the same reduction in error. When considering
the effect of the dimensionality $N$, roughly speaking, a labeled
example is worth exponentially more in the nonparametric than in the
parametric scenario.
The parametric approach uses the Maximum Likelihood technique with
labeled and unlabeled samples to construct a decision rule estimate.
In this scenario the learner knows the parametric form of the pattern
class densities. Sufficient finite sample complexities are established
by which the value of one labeled example in terms of the number of
unlabeled examples is determined to be polynomial in the dimensionality
$N$. The analysis may provide the details for broadening the results
to other non Gaussian parametric based families of problems. An extension
to the case of different a priori class probabilities is investigated
under this parametric scenario, and for the non-unit covariance Gaussian
problem it is conjectured that the value of a labeled example is still
polynomial in $N$.
In the nonparametric scenario the primary focus is on an algorithm
which is based on Kernel Density Estimation. It uses a mixed sample
to construct a decision rule where now the learner has significantly
less side information about the class den sities. The finite sample
complexities for learning the Gaussian based problem are established
by which the value of one labeled example is determined to be exponential
in the dimensionality $N$. An extension to a larger family of nonparametric
classification problems is provided where the same tradeoff applies.
A variant of this approach is investigated in which only a finite
number of functional values of the underlying mixture density are
estimated. This yields a smaller tradeoff but is still exponential
in $N$. The mixed sample complexities for the classical $k$-means
clustering procedure are also determined.
An experimental investigation using neural networks examines the value
of a labeled example when learning a classification problem based
on a Gaussian mixture. For other classification problems, the cost
of learning measured by the labeled sample size as a function of the
dimensionality $N$, is shown to be lower for a two-layer network
than with the regular single layer Kohonen network. This is attributed
to the better discrimination ability of the partition of the classifier.
Original language | English |
---|---|
Publisher | University of Pennsylvania |
Number of pages | 2 |
State | Published - Aug 1994 |
Externally published | Yes |