MLCourse:Linear classification and Perceptron
From Dahuawiki
We first study the most simple problem in machine learning - linear classification.
Contents |
Binary classifier
For the samples in a d-dimensional space
, a binary classifier is a binary valued mapping given as
With the supervised learning setting, the classifier is trained based on a set of training samples, denoted by
, with corresponding binary
labels, denoted by
.
Linear classifiers through origin
We consider a simple class of classifier called linear classifier, which are thresholded linear mapping from the sample space to the binary values. Formally, they are given by the following form,
where
is a d-dimensional vector of real valued parameters, or say
.
Geometrically, in the sample space, the decision boundary of the above classifier is a line or (hyper)plane given by
at which the output lables of the samples change. It means that the samples residing on different sides of the boundary are considered in different classes.
Perceptron
Update rule
How to learn the linear classifier with training samples given? Here, learning the classifier is equivalent to estimate the parameters
.
A simplest algorithm to this end is the perceptron update rule. It considers all the training samples one by one, and updates the parameters if the current sample is misclassified using the following rule
The updating procedure cycles through all training samples until no mistake is encountered in a cycle (convergence).
The updating tends to correct the mistakes. Suppose
is misclassified, then we have
. After the parameters are updated to
, the value given by the linear function changes to
We can see that the value increases as a result of the update (becomes more positive).
Mistakes on different samples may steer the parameters in different directions so it may not be clear that the algorithm converges to a proper solution. We disuss this issue in the following.
Convergence
It has been shown that
If the training samples are linearly separable, the perceptron algorithm converges to give a prediction function that correctly classifies all samples in a finite number of updates.
We assume that
- all the training samples have bounded Euclidean norms (denoted by R)
-
- there exists a linear classifier (denoted by
the correctly classifies all the training samples with finite margin (denoted by γ)
-
Based on the two assumptions above, we can prove the convergence as follows
When we make the k-th update due the misclassification on, we get
Thus after k updates with have
In the other aspect, we have
Since
is misclassified, thus
, then we get
Combining the two aspects, we can have
Since normalized correlation is bounded by 1, we get
![]()
The above proof shows that the lower bound of the normalized correlation between the updated parameters and the ideal parameters increases. And the parameters converges in finite number of updates.
Geometric Analysis
The distance between a sample
and the decision plane defined by
is given by
which is the magnitude of the projection of
onto the plane.
Since γ is the lower bound of
, then the smallest distance between the sample and the ideal boundary is
. We call this value
the geometric margin. It serves as a measure of how well the two classes of samples are separated by a linear boundary.
Hence, we can rewrite the upper bound of update times as
It tells us that the difficulty of the problem is related to the ratio of the sample magnitude to the geometric margin.
The value given by (R / γgeom)2 essentially offers the measures for the following aspects:
- how fast the perceptron algorithm converges
- how difficult to separate the samples in the two classes
- how complex the classifier is. (It closely relates to the measurement of model complexity called VC-dimension)
- how well the classifier is expected to generalize on new samples
Thus after k updates with have
In the other aspect, we have
Since
, then we get
Combining the two aspects, we can have
Since normalized correlation is bounded by 1, we get
