MLCourse:Linear classification and Perceptron

From Dahuawiki

Jump to: navigation, search

We first study the most simple problem in machine learning - linear classification.

Contents

Binary classifier

For the samples in a d-dimensional space \mathcal{R}^d, a binary classifier is a binary valued mapping given as f: \mathcal{R}^d \rightarrow \{-1, 1\}

With the supervised learning setting, the classifier is trained based on a set of training samples, denoted by \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n, with corresponding binary \pm 1 labels, denoted by y_1, y_2, \ldots, y_n.

Linear classifiers through origin

We consider a simple class of classifier called linear classifier, which are thresholded linear mapping from the sample space to the binary values. Formally, they are given by the following form,

f(x; \boldsymbol{\theta}) = \mathrm{sign}(\boldsymbol{\theta}^n \mathbf{x}),

where \boldsymbol{\theta} is a d-dimensional vector of real valued parameters, or say \boldsymbol{\theta} \in \mathcal{R}^d.

Geometrically, in the sample space, the decision boundary of the above classifier is a line or (hyper)plane given by \boldsymbol{\theta}^T \mathbf{x} = 0, at which the output lables of the samples change. It means that the samples residing on different sides of the boundary are considered in different classes.

Perceptron

Update rule

How to learn the linear classifier with training samples given? Here, learning the classifier is equivalent to estimate the parameters \boldsymbol{\theta}.

A simplest algorithm to this end is the perceptron update rule. It considers all the training samples one by one, and updates the parameters if the current sample is misclassified using the following rule

\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + y_t \mathbf{x}_t, \quad \mathrm{if} \ y_t \ne f(\mathbf{x}_t; \boldsymbol{\theta}^{(t)})

The updating procedure cycles through all training samples until no mistake is encountered in a cycle (convergence).

The updating tends to correct the mistakes. Suppose \mathbf{x}_t is misclassified, then we have y_t \boldsymbol\theta^T \mathbf{x}_t < 0. After the parameters are updated to \boldsymbol\theta', the value given by the linear function changes to

y_t \boldsymbol\theta'^T \mathbf{x}_t = y_t (\boldsymbol\theta + y_t \mathbf{x}_t)^T \mathbf{x}_t = y_t \boldsymbol\theta^T \mathbf{x}_t + ||\mathbf{x}_t||^2

We can see that the value increases as a result of the update (becomes more positive).

Mistakes on different samples may steer the parameters in different directions so it may not be clear that the algorithm converges to a proper solution. We disuss this issue in the following.

Convergence

It has been shown that

If the training samples are linearly separable, the perceptron algorithm converges to give a prediction function that correctly classifies all samples in a finite number of updates.

We assume that

  1. all the training samples have bounded Euclidean norms (denoted by R)
    ||\mathbf{x}_t|| < R, \quad \forall t = 1, 2, \ldots, n
  2. there exists a linear classifier (denoted by \boldsymbol\theta^* the correctly classifies all the training samples with finite margin (denoted by γ)
    \exists \boldsymbol\theta^* \in \mathcal{R}^d, \ \mathrm{and} \ \gamma > 0: y_t{\boldsymbol\theta^*}^T \mathbf{x}_t \ge \gamma, \ \forall t = 1, 2, \ldots, n

Based on the two assumptions above, we can prove the convergence as follows

When we make the k-th update due the misclassification on \mathbf{x}_t, we get
    {\boldsymbol\theta^*}^T \boldsymbol\theta^{(k)} = {\boldsymbol\theta^*}^T \boldsymbol\theta^{(k-1)} + y_t {\boldsymbol\theta^*}^T \mathbf{x}_t \ge {\boldsymbol\theta^*}^T \boldsymbol\theta^{(k-1)} + \gamma
Thus after k updates with have
    {\boldsymbol\theta^*}^T \boldsymbol\theta^{(k)} \ge k \gamma
In the other aspect, we have
    ||\boldsymbol\theta^{(k)}||^2 = ||\boldsymbol\theta^{(k-1)} + y_t \mathbf{x}_t||^2        = ||\boldsymbol\theta^{(k-1)}||^2 + 2y_t(\boldsymbol\theta^{(k-1)})^T\mathbf{x}_t + ||\mathbf{x}_t||^2
Since \mathbf{x}_t is misclassified, thus y_t(\boldsymbol\theta^{(k-1)})^T\mathbf{x}_t \le 0, then we get
    ||\boldsymbol\theta^{(k)}||^2 \le ||\boldsymbol\theta^{(k-1)}||^2 + ||\mathbf{x}_t||^2 \le ||\boldsymbol\theta^{(k-1)}||^2 + R^2
Combining the two aspects, we can have
    \langle \boldsymbol\theta^{(k)}, \boldsymbol\theta^* \rangle       = \frac{(\boldsymbol\theta^*)^T \boldsymbol\theta^{(k)}}{||\boldsymbol\theta^*|| \cdot ||\boldsymbol\theta^{(k)}||}       \ge \frac{k \gamma}{\sqrt{k} R ||\boldsymbol\theta^*||}
Since normalized correlation is bounded by 1, we get 
    k \le \frac{R^2 ||\boldsymbol\theta^*||^2}{\gamma^2}

The above proof shows that the lower bound of the normalized correlation between the updated parameters and the ideal parameters increases. And the parameters converges in finite number of updates.

Geometric Analysis

The distance between a sample \mathbf{x}_t and the decision plane defined by \boldsymbol\theta^T \mathbf{x} = 0 is given by \frac{|\boldsymbol\theta^T \mathbf{x}|}{||\boldsymbol \theta||}, which is the magnitude of the projection of \mathbf{x}^T onto the plane.

Since γ is the lower bound of |\boldsymbol\theta^T\mathbf{x}_t|, then the smallest distance between the sample and the ideal boundary is \frac{\gamma}{||\boldsymbol\theta^*||}. We call this value \gamma_{geom} = \frac{\gamma}{||\boldsymbol\theta^*||} the geometric margin. It serves as a measure of how well the two classes of samples are separated by a linear boundary.

Hence, we can rewrite the upper bound of update times as

k \le \frac{R^2}{\gamma_{geom}^2} = \left(\frac{R}{\gamma_{geom}} \right)^2.

It tells us that the difficulty of the problem is related to the ratio of the sample magnitude to the geometric margin.

The value given by (R / γgeom)2 essentially offers the measures for the following aspects:

  1. how fast the perceptron algorithm converges
  2. how difficult to separate the samples in the two classes
  3. how complex the classifier is. (It closely relates to the measurement of model complexity called VC-dimension)
  4. how well the classifier is expected to generalize on new samples