MLCourse:Kernel SVM

From Dahuawiki

Jump to: navigation, search

By incorporating kernel techniques with support vector machine, we can derive a nonlinear classifier with large margin in feature space.

Kernel SVM (Strict formulation)

Suppose we have a set of training samples (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n). Given a feature mapping φ(x), the strict formulation (without slack variables) of support vector machine in the feature space is

\mbox{minimize } \frac{1}{2}||\theta||^2, \quad \mbox{s.t. }  y_t (\theta^T  \phi(x_t) + \theta_0) \ge 1, \ t = 1, 2, \ldots, n.

This is called primal form of the SVM problem.

In order to applying kernel tricks, we can turn it into its dual form such that the features only appear in inner products. The Lagrangian of the above problem is

L(\theta, \theta_0; \alpha) =  \frac{1}{2} ||\theta||^2  - \sum_{t=1}^n \alpha_t \left(y_t (\theta^T \phi(x_t) + \theta_0) - 1 \right), \quad \alpha_t \ge 0, \ \ t = 1, 2, \ldots, n.

At the optima, the derivatives of the Lagrangian with respect to the parameters zeros. Thus,

\frac{\partial L}{\partial \theta}|_{\hat{theta}} = 0  \Rightarrow  \hat{\theta} = \sum_{t=1}^n \alpha_t y_t \phi(x_t);
\frac{\partial L}{\partial \theta_0}|_{\hat{\theta}_0} = 0 \Rightarrow  \sum_{t=1}^n \alpha_t y_t = 0.

Plugging them to the Lagrangian, we can derive the dual form according to the Slater Conditions as follows

\mbox{maximize } \sum_{t=i}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n\sum_{j=1}^n\alpha_i \alpha_j y_i y_j K(x_i, x_j),
\mbox{s.t. } \sum_{i=1}^n \alpha_i y_i = 0,  \mbox{ and } \alpha_i = 0, \ i = 1, 2, \ldots, n.

It is also a convex quadratic optimization problem. However, the variables to be solved in the dual form are α.

With the optimal values of α obtained, the parameters can be derived as follows.

\hat\theta = \sum_{t=1}^n \hat\alpha_t y_t \phi(x_t).

Hence, the discriminant function is given by

\hat{y}(x) = \hat\theta^T \phi(x) + \hat\theta_0 = \sum_{t=1}^n \hat\alpha_t y_t \phi(x_t)^T \phi(x) + \hat\theta_0 = \sum_{t \in SV} \hat\alpha_t y_t k(x_t, x) + \hat \theta_0.

While the offset θ0 can be computed from the following constraint on support vector,

y_i (\hat\theta^T \phi(x) + \hat\theta_0)  = y_i \left( \sum_{t \in SV} \hat\alpha_t y_t k(x_t, x) + \hat \theta_0 \right) = 1

Here, the i-th sample is support vector, i.e. \alpha_i > 0\,. In principle, each support vector results in the same offset value. However, there will be some deviation due to the computation resolution. Thus it is advisable to take th median of the solutions given by all support vectors.

The geometric margin (in the feature space) of the classifier is

\hat{\gamma}_{gemo} = \frac{1}{||\hat\theta||} = \left(\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j K(x_i, x_j)\right)^{-1/2}.

Kernel SVM (Relaxed formulation)

We can introduce slack variables to allow some training samples violate the constraint with finite penalty. The primal form of the problem is

\mbox{minimize } \frac{1}{2} ||\theta||^2 + C \sum_{t=1}^n \xi_t,
\mbox{s.t. } y_t(\theta^T \phi(x_t) + \theta_0) \ge 1 - \xi_t, \ t = 1, 2, \ldots, n,  \mbox{ and } \xi_t \ge 0, \ t = 1, 2, \ldots, n.

The Lagrangian is given by

L(\theta, \theta_0, \alpha, \mu) = \frac{1}{2} ||\theta||^2 + C \sum_{t=1}^n \xi_t  - \sum_{t=1}^n \alpha_t (y_t(\theta^T \phi(x_t) + \theta_0) + \xi_t - 1) - \sum_{t=1}^n \mu_t \xi_t,  \quad \alpha_t \ge 0, \ \mu_t \ge 0, \ t = 1, 2, \ldots, n.

At optima, the derivatives of the Lagrangian are zeros.

\frac{\partial L}{\partial \theta}|_{\hat\theta} = 0 \Rightarrow \hat\theta = \sum_{t=1}^n \alpha_t y_t x_t;
\frac{\partial L}{\partial \theta_0}|_{\hat\theta_0} = 0 \Rightarrow \sum_{t=1}^n \alpha_t y_t = 0;
\frac{\partial L}{\partial \xi_t}|_{\hat\xi_t} = 0 \Rightarrow C - \alpha_t - \mu_t = 0.

Then the dual form can be derived by combining the Lagrangian with the equations above

\mbox{maximize } \sum_{i=1}^n \alpha_i - \frac{1}{2} \alpha_i \alpha_j y_i y_j K(x_i, x_j),

subject to

\sum_{i=1}^n \alpha_i y_i = 0,
0 \le \alpha_i \le C, \ i = 1, 2, \ldots, n.

Solving the dual problem, we can obtain the optimal values of α. Then, the parameter can be given by

\hat\theta = \sum_{t=1}^n \hat\alpha_t y_t x_t = \sum_{t \in SV} \hat\alpha_t y_t x_t.

As a result, the discriminant function is

\hat{y}(x) = \sum_{t \in SV} \hat\alpha_t y_t k(x_t, x) + \hat\theta_0.

In order to see what how to derive the offset parameter θ0. We further analyze the constraints. According to the KKT condition, at optima, we have

\hat\alpha_i \left(y_i (\sum_{t \in SV} \hat\alpha_t y_t k(x_t, x_i) + \hat\theta_0) - (1 - \xi_t)\right) = 0
\hat\mu_i \xi_i = 0.

In addition, we have \alpha_t + \mu_t = C\, There are three possible cases.

  1. \alpha_i = 0, \mu_i = C\,
    It means that the margin constraint is satisfied and inactive. The i-th sample is not support vector. The value of \xi_i\, is determined by the nonnegative constraint, which is active.
  2. 0 < \alpha_i < C, 0 < \mu_i < C\,
    It means that both the margin constraint and the nonnegative constraint are active. The i-th sample lies at margin, and is support vector. Since the nonnegative constraint is active, \xi_i = 0\,.
  3. \alpha_i = C, \mu_i = 0\,
    It means that the margin constraint is active, while nonnegative constraint is not tight. It is possible that \xi_i > 0\,, and thus the i-th sample probably violates the margin constraint. The samples in this case are also support vectors.

Consequently, the offset θ0 can be obtained from the support vectors with 0 < \alpha_i < C\, based on the following equation

y_i (\sum_{t\in SV} \hat\alpha_t y_t K(x_t, x_i) + \hat\theta_0) = 1.