MLCourse:Kernels and Kernel regression
From Dahuawiki
Contents |
What is Kernel
Consider some feature mapping
, the corresponding kernel function is defined as
Formally, a kernel function can be defined in two different ways.
- The function
is a kernel function when there exists a feature mapping
, such that
- The function
is a kernel function when the gram matrix of any set of samples
, defined as
is positive semi-definite.
In practice, a kernel function is selected in order to reflect the similarities between samples.
Standard kernels
Here lists some standard examples of kernels that are widely used.
- Linear kernel
- Polynomial kernel
- Radial basis kernel
Constructing kernels
We can construct various kernels from simpler ones based on the following rules
-
is a kernel;
- Given kernel
and any real-valued function f,
is a kernel;
- Given kernels
,
is a kernel;
- Given kernels
,
is a kernel.
Kernel Regression
The kernel regression model is generalized from linear regression model, which is given by
.
Considering the high dimensionality of the feature space, it is necessary to regularize the parameter estimates. The regularized least square objective with kernel is given by
This form can be derived from penalized log-likelihood estimation.
The regularization term tends to drive the parameters towards zeros. So any linear components are uncorrelated to the training samples will necessarily be set to zero. As a result, the optimal parameters lie in the span of the feature vectors corresponding to the training samples.
The optimality condition of
is derived by setting the gradient to zero as
which results in
Here,
. This clearly manifests that
lies in the span of the training features.
Combining the formulas above, we get
Hence αt only depends on the kernel values. With gram matrix, we can rewrite the equation in matrix form as
The solution is
After
is obtained, the prediction can be given by
Kernel Perceptron
Kernel-based classifier gives prediction on a sample x as
- y = sign(θTφ(x)).
We use perceptron to train the classifier. The update rule in the context of kernel is generalized to be
When the process converges, the solution is
Clearly,
is the number of mistakes made on the t-th sample.
The algorithm is described as follows
- Initialize
;
- Cycle through the updating steps as
Then, the prediction can be computed as
Here,
actually indicates the importance of a sample in predictions.
Kernel Optimization
We have discussed how to incorporating nonlinearity into the classification and regression algorithms with kernel. A question arises here is that how to choose an appropriate kernel.
An approach is to parameterize a kernel function and optimize the parameters in order to derive a kernel adapted better to the data. The key to this approach is which objective we would optimize. An ideal objective is the generalization risk, which is however infeasible to measure in practice. The surrogate measures include cross-validation or geometric margin.
There is a caveat that we should pay attention to when we use geometric margin as a measure of how good a kernel is. Without normalization, we can end up with arbitrarily large margin by simply scaling up the kernel, or the feature mapping. Therefore, the margin cannot serve as an appropriate criterion under such circumstances. The normalization of a kernel can be done as follows
which corresponds to the feature mapping as
Another approach is kernel alignment. The basic idea is to adjust the kernel parameters so as to make it , or its Gram matrix, more towards an ideal target kernel. The similarity of a parameterized kernel
and a target kernel
can be computed in terms of normalized inner product of the gram matrices
Note that the kernel optimization mainly focus on fitting the training data well. In order to improve the generalization performance, we should carefully constrain the class of kernel functions that we can choose from. It is related to a core topic in machine learning - model selection.
