MLCourse:Linear regression and Ridge regression

From Dahuawiki

Revision as of 01:33, 4 October 2007; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

Contents

Formulation

We now study the problem of predicting continuous output value given an input vector. One of the simplest form is linear regression. The relationship between input and output is given by

y = \boldsymbol\theta^T\mathbf{x} + \theta_0 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2),

where \varepsilon is a noise term with zero mean. Then the mean prediction is given by a linear function as

E\{y | \mathbf{x}\} = \boldsymbol\theta^T \mathbf{x} + \theta_0.

Consider the simplest setting that $\theta_0$ satisfies the normal distribution, then the problem is formulated as

P(y|\mathbf{x}; \boldsymbol\theta, \theta_0, \sigma^2) = \mathcal{N}(\boldsymbol\theta^T \mathbf{x} + \theta_0, \sigma^2).

Here, the normal distribution is given by

\mathcal{N}(\mu, \sigma^2) = \frac{1}{2} \exp\left(- \frac{1}{2\sigma^2} (y - \mu)^2 \right)

Maximum Likelihood Estimation

Then the parameters can be solved by maximum likelihood estimation with the log-likelihood being

l(\boldsymbol\theta, \theta_0, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{t=1}^n (y_t - \boldsymbol\theta^T \mathbf{x}_t - \theta_0)^2.

This results in the optimization problem to minimize the following objective

\sum_{t=1}^n (y_t - \boldsymbol\theta^T \mathbf{x} - \theta_0)^2.

To solve this problem, we first introduce some notations as follows

\mathbf{y} = \left[ \begin{matrix} y_1 \\ \cdots \\ y_n \end{matrix} \right], \quad \mathbf{X} = \left[\begin{matrix} \mathbf{x}_1^T & y_1 \\ \cdots & \cdots \\ \mathbf{x}_n^T & y_n \end{matrix} \right].

Thereby, the objective function can be written in a matrix form as

\left|\left|\mathbf{y} - \mathbf{X}\left[\begin{matrix}\boldsymbol\theta \\ \theta_0 \end{matrix}\right]\right|\right|_F^2.

Then the solution can be derived as follows

\left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat{\theta}_0 \end{matrix} \right] = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

In addition, the noice variance can be solved by MLE. The following is the solution

\hat{\sigma}^2 = \frac{1}{n} \sum_{t=1}^n (y_t - \boldsymbol\theta^T \mathbf{x}_t - \theta_0)^2,

which is the mean of the squared deviation, and measures how well the linear model explains the responses.

Bias and Variance of the Estimates

To analyze how good the estimates of parameters are, we make a strong assumption that there actually exists an underlying linear model that governs the relation between \mathbf{x} and y, which is characterized by some unknown parameters \boldsymbol\theta^*, \theta_0^*, \sigma^{*2}. Therefore, we can describe the observations as

y_t = \boldsymbol\theta^{*T} \mathbf{x}_t + \theta_0^* + \varepsilon_t.

This can be written in a matrix form as

\mathbf{y} = \mathbf{X} \left[ \begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix} \right] + \mathbf{e},

where \mathbf{e} = [\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_n]^T that satisfies

E\{\mathbf{e}\} = \mathbf{0}, \quad E\{\mathbf{e}\mathbf{e}^T\} = \sigma^{*2}\mathbf{I}.

Bias and covariance of the parameter estimates

Plugging this formula to the MLE of parameters given above, we havb

\left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat{\theta_0} \end{matrix}\right] = \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix} \right] +  (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{e}.

We can see that the parameter estimates that we obtain can be decomposed into the sum of the true underlying parameters and the estimates based on the noise alone.

With fixed input, we then have the expectation of the estimates given by

E\left\{\left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat{\theta}_0 \end{matrix}\right] | \mathbf{X}\right\} = \left[\begin{matrix} \boldsymbol\theta^* \\ \theta^*_0 \end{matrix} \right],

which implies that the estimates of the parameters are unbiased.

We also have the covariance of the estimates as follows

\mathrm{cov}\left\{\left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat{\theta}_0 \end{matrix}\right] | \mathbf{X}\right\}  = \sigma^{*2} (\mathbf{X} \mathbf{X}^T)^{-1}.

Since, the estimates are unbiased, the mean squared error of them with respect to the true parameters is the variance of the parameters, which equals the trace of the covariance matrix, as

E\left\{\left|\left|\left[\begin{matrix}\hat{\boldsymbol\theta} \\ \hat{\theta}_0\end{matrix}\right] - \left[ \begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right]\right|\right|^2 | \mathbf{X}\right\}  = \mathrm{var}\left\{\left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat{\theta}_0 \end{matrix}\right] | \mathbf{X}\right\} = \sigma^{*2} \mathrm{trace}(\mathbf{X}^T \mathbf{X})^{-1}.

Bias and variance of the predictions

With a trained model, the prediction is given by

\hat{z} = \hat{\boldsymbol\theta}^T \mathbf{x} + \hat{\theta}_0,

while the true response should be

z^* = \hat{\boldsymbol\theta}^{*T} \mathbf{x} + \hat{\theta}_0^*.

The expectation of the prediction is

E\{\hat{z}|\mathbf{x}; \mathbf{X}\} =  E\{\hat{\boldsymbol\theta}\}^T \mathbf{x} + E\{\hat\theta_0\} =  \hat{\boldsymbol\theta}^{*T} \mathbf{x} + \hat{\theta}_0^* = z^*.

Thus, the prediction is unbiased.

The covariance of the prediction is

\mathrm{var}(\hat{z}|\mathbf{x}; \mathbf{X}) =  \left[\begin{matrix} \mathbf{x} \\ 1 \end{matrix}\right]^T \mathrm{cov}\left\{\left[\begin{matrix}\boldsymbol\theta \\ \theta_0 \end{matrix}\right]\right\} \left[\begin{matrix} \mathbf{x} \\ 1 \end{matrix}\right] = \left[\begin{matrix} \mathbf{x} \\ 1 \end{matrix}\right]^T \sigma^{*2} (\mathbf{X}^T \mathbf{X})^{-1} \left[\begin{matrix} \mathbf{x} \\ 1 \end{matrix}\right]

Ridge Regression

It is often desirable to regularize the parameter estimates especially when the number of training samples is small compared to the number of parameters to estimate. Here, we derive the form of regularization by incorporating a prior distribution over the parameters.

Formulation

Specially, we consider the simple zero-mean Gaussian distribution as prior distribution, given by

P(\boldsymbol\theta, \theta_0 | \sigma'^2) =  \mathcal{N}(\left[\begin{matrix} \boldsymbol\theta \\ \theta_0 \end{matrix}\right];  \left[\begin{matrix} \mathbf{0} \\ 0 \end{matrix}\right], \sigma'^2 \mathbf{I}).

The variance σ'2 specifies how strongly we wish to bias the parameters towards zeros.

With the prior incorporated, the the log-likelihood is given by

l(\boldsymbol\theta, \theta_0) =  - \frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2)  - \frac{1}{2\sigma^2} \sum_{t=1}^n (y_t - \boldsymbol\theta^T \mathbf{x}_t - \theta_0)^2 - \frac{d+1}{2} \log(2\pi) - \frac{d+2}{2} \log(\sigma'^2) - \frac{1}{2\sigma'^2} (\boldsymbol\theta^T \boldsymbol\theta + \theta_0^2).

It turns out to be an optimization problem to minimize the following objective function

J(\boldsymbol\theta, \theta_0) =  \frac{n}{2} \log(\sigma^2) + \frac{d+1}{2} \log(\sigma'^2) + \frac{1}{2\sigma^2} \sum_{t=1}^n (y_t - \boldsymbol\theta^T \mathbf{x}_t - \theta_0)^2 + \frac{1}{2\sigma'^2} (\boldsymbol\theta^T \boldsymbol\theta + \theta_0^2).

We typically tie the noise variance σ2 and prior variance σ'2 together in order to conveniently control the trade-off between fitting and regularization. Let σ'2 = σ2 / λ, we have

J(\boldsymbol\theta, \theta_0) =  \frac{n + d + 1}{2} \log(\sigma^2) - \frac{d+1}{2} \log\lambda + \frac{1}{2\sigma^2} \sum_{t=1}^n    \left[ (y_t - \boldsymbol\theta^T \mathbf{x}_t - \theta_0)^2 + \lambda (\boldsymbol\theta^T \boldsymbol\theta + \theta_0^2) \right].

The regularized problem given above is called Rigde regression.

Maximum Likelihood Estimation

Based on the objective function above, we can solve the parameters \boldsymbol\theta and θ0 by minimizing

(y_t - \boldsymbol\theta^T \mathbf{x}_t - \theta_0)^2 + \lambda (\boldsymbol\theta^T \boldsymbol\theta + \theta_0^2),

which can be rewritten in matrix form as

\left|\left|\mathbf{y} - \mathbf{X} \left[\begin{matrix} \boldsymbol\theta \\ \theta_0 \end{matrix}\right] \right|\right|^2 + \lambda \left[\begin{matrix} \boldsymbol\theta \\ \theta_0 \end{matrix}\right]^T  \left[\begin{matrix} \boldsymbol\theta \\ \theta_0 \end{matrix}\right]

The optimal solution is derived as

\left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat\theta_0 \end{matrix}\right] = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}

Bias and variance analysis

Now we study the bias and variance of the estimates obtained by ridge regression.

The expectation of the estimated parameters is

E\left\{ \left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat\theta_0 \end{matrix}\right] \right\} =  (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T  \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right] = \left(\mathbf{I} - \lambda(\lambda\mathbf{I} + \mathbf{X}^T\mathbf{X})^{-1}\right) \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right] =  \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right] -  \lambda(\lambda\mathbf{I} + \mathbf{X}^T\mathbf{X})^{-1}  \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right]

We can see that the parameter estimates are biased, shrinking towards zeros. The regularization coefficient λ controls the degree of shrinking.

The covariance of the estimated parameters is

\mathrm{cov}\left\{ \left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat\theta_0 \end{matrix}\right] \right\} = = \sigma^{*2}(\lambda \mathbf{I} + \mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{X} (\lambda \mathbf{I} + \mathbf{X}^T \mathbf{X})^{-1} = \sigma^{*2}(\lambda \mathbf{I} + \mathbf{X}^T \mathbf{X})^{-1} - \lambda \sigma^{*2}(\lambda \mathbf{I} + \mathbf{X}^T \mathbf{X})^{-2}

Hence, the mean squared error of the estimates is

E\left\{\left|\left|   \left[\begin{matrix} \hat{\boldsymbol\theta} \\ \hat\theta_0 \end{matrix}\right] -  \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right] \right|\right|^2 | \mathbf{X} \right\} = \sigma^{*2} \mathrm{trace}\left( (\lambda \mathbf{I} + \mathbf{X}^T \mathbf{X})^{-1} - \lambda (\lambda \mathbf{I} + \mathbf{X}^T \mathbf{X})^{-2} \right) +  \lambda^2 \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right]^T (\lambda \mathbf{I} + \mathbf{X}^T \mathbf{X})^{-2}  \left[\begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix}\right]

We can reduce the mean squared error by setting a proper value of λ, at the expense of having a biased estimation.


Active Learning for Linear Regression

As we have seen, the bias and variance of the estimates are affected by the training samples. Thus, we can actively select input samples to order to improve the estimates. This is called Active Learning.

The goal is to select as few responses as possible without compromising the estimation accuracy. By incorporating some guidance in selecting training samples, we can generally need far fewer examples in comparison to randomly drawing them from the underlying distribution.

Minimizing mean square error of the parameter estimates

Here, we discuss the active learning problem in the context of linear regression (without regularization). Recall that the mean square error of the parameter estimates is

E\left\{ \left|\left| \left[ \begin{matrix} \hat{\boldsymbol\theta} \\ \hat\theta_0 \end{matrix} \right] -  \left[ \begin{matrix} \boldsymbol\theta^* \\ \theta_0^* \end{matrix} \right] \right| \right|^2 | \mathbf{X} \right\}  = \sigma^{*2} \mathrm{trace} \left( (\mathbf{X}^T \mathbf{X})^{-1} \right).

We can select training samples in order to minimize \mathrm{trace} \left( (\mathbf{X}^T \mathbf{X})^{-1} \right). One caveat of the approach is that it relies on the underlying relationship between the inputs and the responses to be linear. When this is no longer the case, we may end up with suboptimal selections.

We are considering such a problem. Suppose we have already have sample set \mathbf{X}, and try to select another input example \mathbf{x}. Let \mathbf{A} = (\mathbf{X}^T \mathbf{X})^{-1} and \mathbf{v} = [\mathbf{x}, 1]^T, then the problem is to find a valid \mathbf{v} that minimizes

\mathrm{trace}\left( (\mathbf{A}^{-1} + \mathbf{vv}^T)^{-1} \right).

The matrix inverse is given by

(\mathbf{A}^{-1} + \mathbf{vv}^T)^{-1} = \mathbf{A} - \frac{1}{(1 + \mathbf{v}^T \mathbf{A} \mathbf{v})}\mathbf{Avv}^T \mathbf{A},

so that the trace becomes

\mathrm{trace}\left( (\mathbf{A}^{-1} + \mathbf{vv}^T)^{-1} \right) -  \frac{\mathbf{v}^T \mathbf{AAv}}{(1 + \mathbf{v}^T \mathbf{Av})}.

Since \mathbf{A} is fixed, we are to maximize

\frac{\mathbf{v}^T \mathbf{AAv}}{(1 + \mathbf{v}^T \mathbf{Av})}.

If there is no any constraint on \mathbf{v}, then we will end up with an infinite solution. If we constrain ||\mathbf{v}|| \le c, then the optimal \mathbf{v} is the eigenvector corresponding to the largest eigenvalue of magnitude c. However, given the constraint that the last element of \mathbf{v} is 1, there is no generic analytical solution.

Finding the most uncertain sample

We can also try to find the point \mathbf{x} whose response that we are most uncertain about. Recall that

\mathrm{var}\left\{y |\mathbf{x}; \mathbf{X}\right\} =  \left[ \begin{matrix} \mathbf{x} \\ 1 \end{matrix} \right]^T  \sigma^{*2} (\mathbf{X}^T \mathbf{X})^{-1}  \left[ \begin{matrix} \mathbf{x} \\ 1 \end{matrix} \right] = \sigma^{*2} \mathbf{v}^T \mathbf{Av}.

Then we are to maximize this function with the constraint that the last element of \mathbf{v} is 1.