MLCourse:Maximum margin classifier and Support vector machine
From Dahuawiki
←Older revision | Newer revision→
Contents |
Maximum margin classifier
There exist multiple linear classifier that can perfectly classifies a set of linearly separable samples. It is desirable to derive the one with largest geometric margin, whose decision boundary is well separated from all training samples.
Given a set of training samples
, and their corresponding binary labels
, the problem can be straightforwardly written as
For convenience of optimization, we rewrite it into a problem to minimize a quadratic function of the parameters as
A further revision is given as follows
which shows that only the ratio
counts in the problem. We can solve the parameters
up to a constant scale. Without losing generality, we can set γ = 1. This results in a simplified formulation given by
It is a convex quadratic programming problem, and the solution to this problem is unique. This maximum margin linear classifier is the primal form of support vector machine.
Linear support vector machine
We modify the linear classifier by adding an offset term. The generalized form is given as
The incorporation of the enhance the flexibility of the linear classifier by eliminating the contraint of "passing through the origin". Generally, it can lead to a classifier with larger geometric margin.
As a consequence, the optimization problem is modified to
- Note
- The generalized formulation with a constant offset is different from the augmenting formulation by appending a each sample with a scalar 1.
Properties of support vector machines
The support vector machine has several nice properties.
- Drawing the separating boundary as far as possible from all training samples makes it robust to noisy samples.
- It tends to seek sparse solution, which depends on only a subset of samples that lie exactly on the margin. Those samples exactly on the margin
are called support vectors.
- The sparsity of the solution brings down the upper bound of cross-validation errors. As only the removal of the support vectors from the training set affects the solution. Then in linearly separable cases, we have
- Here, LOCVE is the leave-one-out cross-validation error of the support vector machine algorithm, #SV is the number of support vectors when all training samples are used.
However, the formulation is sensitive to mislabeled samples. Even only one sample is mislabeled. it may radically change the decision boundary or even lead to the failure of finding the classifier. Another problem is that it can not be applied to the case where the training samples are not linearly separable. These two drawbacks mainly originate from the hard constraints.
Relaxation: Allowing misclassified samples
The aforementioned problems can be effectively addressed by relaxing the margin contraint and thus allowing misclassified samples.
In mathematics, it introduces the "slack variables", which measures the degree to which the contraints are violated. While in the optimization objective function, we explicitly penalize the violation. This gives rise to the following relaxed formulation
Here, C is the coefficient of the cost for violating the margin constraint, which tunes the trade-off between misclassification and the potential benefit for other samples by gain in decreasing the norm of the parameters. Large C incurs heavy penality of contraint violation, while small C may achieve smaller parameter norm at the expense of allowing more contraints be violated.
In the relaxed formulation with C > 0, the support vectors, which are those vectors that have influence on the solution, include the samples exactly on the margin as well as the samples that violate the contraint.
Regularization
The SVM formulation can also be considered in the view of regularization. The formulation of regularized problems typcially involves a desired objective and a regularization penalty, which is used to
- help stabilize the minimization of the objective, or
- incorporate prior knowledge of the solutions.
In the linear classification problem, we can define the following loss function measuring how much the prediction deviates from the truth
We can write this more succinctly as
In addition, we prefer large margin. Such a preferrence can be reflected by incorporating a regularization term
.
As a consequence, we finally obtain
.
Here, the first term is a regularization penalty that encourages large margin, while the other term is a hinge loss that penalizes the prediction errors.
To sum up, generally, the loss function specifies what we want to achieve, while the regularization penalty selects the preferred solution.
