MLCourse:Kernels and Kernel regression

From Dahuawiki

Jump to: navigation, search

Contents

What is Kernel

Consider some feature mapping \phi(\mathbf{x}), the corresponding kernel function is defined as

K(x, x') = \phi^T(x) \phi(x').\,

Formally, a kernel function can be defined in two different ways.

  • The function K: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} is a kernel function when there exists a feature mapping \phi\,, such that
K(x, x') = \phi^T(x) \phi(x'), \quad x, x' \in \mathcal{X}.
  • The function K: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} is a kernel function when the gram matrix of any set of samples x_1, x_2, \ldots, x_n \in \mathcal{X}, defined as
\mathbf{K} =  \left[\begin{matrix} K(x_1, x_1) & K(x_1, x_2) & \cdots & K(x_1, x_n) \\ K(x_2, x_1) & K(x_2, x_2) & \cdots & K(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ K(x_n, x_1) & K(x_n, x_2) & \cdots & K(x_n, x_n) \end{matrix}\right],

is positive semi-definite.

In practice, a kernel function is selected in order to reflect the similarities between samples.

Standard kernels

Here lists some standard examples of kernels that are widely used.

  • Linear kernel
K(x, x') = x^T x'\,
  • Polynomial kernel
K(x, x') = (1 + x^T x')^p, \quad, p = 1, 2, \ldots
  • Radial basis kernel
K(x, x') = \exp\left(-\frac{\beta}{2}||x - x'||^2\right), \quad \beta > 0

Constructing kernels

We can construct various kernels from simpler ones based on the following rules

  1. K(x, x') = 1\, is a kernel;
  2. Given kernel K_1\, and any real-valued function f, K(x, x') = f(x)K_1(x, x')f(x')\, is a kernel;
  3. Given kernels K_1, K_2\,, K(x, x') = K_1(x, x') + K_2(x, x')\, is a kernel;
  4. Given kernels K_1, K_2\,, K(x, x') = K_1(x, x') K_2(x, x')\, is a kernel.

Kernel Regression

The kernel regression model is generalized from linear regression model, which is given by

y = \theta^T \phi(x) + \theta_0 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2).

Considering the high dimensionality of the feature space, it is necessary to regularize the parameter estimates. The regularized least square objective with kernel is given by

J(\theta) = \frac{1}{2} \sum_{t=1}^n (y_t - \theta^T \phi(x_t))^2 + \frac{\lambda}{2}||\theta||^2.

This form can be derived from penalized log-likelihood estimation.

The regularization term tends to drive the parameters towards zeros. So any linear components are uncorrelated to the training samples will necessarily be set to zero. As a result, the optimal parameters lie in the span of the feature vectors corresponding to the training samples.

The optimality condition of \theta\, is derived by setting the gradient to zero as

\frac{\partial J(\theta)}{\theta} =  -\sum_{t=1}^n \left(y_t - \theta^T \phi(x_t)\right) \phi(x_t) + \lambda \theta = 0,

which results in

\hat{\theta} = \frac{1}{\lambda} \sum_{t=1}^n \alpha_t \phi(x_t).

Here, \alpha_t = y_t - \hat{\theta}^T \phi(x_t). This clearly manifests that \hat{\theta} lies in the span of the training features.

Combining the formulas above, we get

\alpha_t = y_t - \frac{1}{\lambda} \sum_{t'=1}^n \alpha_{t'} \phi(x_{t'})^T \phi(x_t).

Hence αt only depends on the kernel values. With gram matrix, we can rewrite the equation in matrix form as

\hat{\boldsymbol\alpha} = \mathbf{y} - \frac{1}{\lambda} \mathbf{K} \hat{\boldsymbol\alpha}.

The solution is

\hat{\boldsymbol\alpha} = \lambda \left(\lambda \mathbf{I} + \mathbf{K} \right)^{-1} \mathbf{y}.

After \hat{\boldsymbol\alpha}\, is obtained, the prediction can be given by

y = \theta^T \phi(x) = \frac{1}{\lambda} \sum_{t=1}^T \alpha_t \phi(x_t)^T \phi(x)  = \frac{1}{\lambda} \sum_{t=1}^n \alpha_t K(x_t, x).

Kernel Perceptron

Kernel-based classifier gives prediction on a sample x as

y = sign(θTφ(x)).

We use perceptron to train the classifier. The update rule in the context of kernel is generalized to be

\theta \leftarrow \theta + y_t \phi(x_t), \quad \mathrm{if} \ y_t(\theta^T \phi(x_t)) \le 0.

When the process converges, the solution is

\hat{\theta} = \sum_{t=1}^n \hat{\alpha}_t y_t \phi(x_t).

Clearly, \hat{\alpha}_t is the number of mistakes made on the t-th sample.

The algorithm is described as follows

  1. Initialize \alpha_1 = \alpha_2 = \cdots = \alpha_n = 0;
  2. Cycle through the updating steps as
    \alpha_t \leftarrow \alpha_t + 1, \quad  \mathrm{if} \  y_t \sum_{t'=1}^n \alpha_{t'} y_{t'} \phi^T(x_{t'}) \phi(x_t) = y_t \sum_{t'=1}^n \alpha_{t'} k(x_{t'}, x_t) \le 0

Then, the prediction can be computed as

\hat{y} = \mathrm{sign}\left(\sum_{t=1}^n \hat\alpha_t y_t K(x_t, x)\right)

Here, \alpha_t\, actually indicates the importance of a sample in predictions.

Kernel Optimization

We have discussed how to incorporating nonlinearity into the classification and regression algorithms with kernel. A question arises here is that how to choose an appropriate kernel.

An approach is to parameterize a kernel function and optimize the parameters in order to derive a kernel adapted better to the data. The key to this approach is which objective we would optimize. An ideal objective is the generalization risk, which is however infeasible to measure in practice. The surrogate measures include cross-validation or geometric margin.

There is a caveat that we should pay attention to when we use geometric margin as a measure of how good a kernel is. Without normalization, we can end up with arbitrarily large margin by simply scaling up the kernel, or the feature mapping. Therefore, the margin cannot serve as an appropriate criterion under such circumstances. The normalization of a kernel can be done as follows

\tilde{K}(x, x') = \frac{K(x, x')}{\sqrt{K(x, x)K(x', x')}},

which corresponds to the feature mapping as

\tilde{\phi}(x) = \phi(x) / ||\phi(x)||.

Another approach is kernel alignment. The basic idea is to adjust the kernel parameters so as to make it , or its Gram matrix, more towards an ideal target kernel. The similarity of a parameterized kernel K_\theta\, and a target kernel K^*\, can be computed in terms of normalized inner product of the gram matrices

\frac{\langle K^*, K_\theta\rangle}{\sqrt{\langle K^*, K^* \rangle \langle K_\theta, K_\theta \rangle}}.

Note that the kernel optimization mainly focus on fitting the training data well. In order to improve the generalization performance, we should carefully constrain the class of kernel functions that we can choose from. It is related to a core topic in machine learning - model selection.