MLCourse:Logistic regression

From Dahuawiki

Jump to: navigation, search

Contents

Logistic function

Generally, the prediction made by a classifier may have uncertainty about the label of a sample, especially when the sample resides near the boundary. To capture this notion, we assign a probability distribution over the two labels

P(y=1 | \mathbf{x}; \boldsymbol\theta, \theta_0) = g(\boldsymbol\theta^T \mathbf{x} + \theta_0),

where g(z) = (1 + exp(z)) − 1 is known as the logistic function.

For this function, we have the following properties

Monotonicity
g(z) monotonically increases as z increases. And,
\lim_{z \rightarrow \infty} g(z) = 1, \quad \mathrm{and} \quad \lim_{z \rightarrow -\infty} g(z) = 0.
Symmetry
g(z) + g(-z) = 1, \quad \mathrm{and} \quad g(-z) = 1 - g(z)
P(y|\mathbf{x}, \boldsymbol\theta, \theta_0) = g(y \cdot (\boldsymbol\theta^T \mathbf{x} + \theta_0))
Log-odds
\log \frac{g(z)}{g(-z)} = z
\log \frac{P(y=1|\mathbf{x}; \boldsymbol\theta, \theta_0)}{P(y=-1|\mathbf{x}; \boldsymbol\theta, \theta_0)} = \boldsymbol\theta^T\mathbf{x} + \theta_0

Here we can see that the log-odds of the predicted class probabilities is exactly the linear prediction value. And, the log-odds becomes zero at the decision boundary.

Maximum Likelihood Estimation

By maximizing the joint likelihood of the samples, we can obtain the maximum likelihood estimates of the parameters. The join likelihood is given by

L(\boldsymbol\theta, \theta_0) = \prod_{t=1}^n P(y_t|\mathbf{x}_t; \boldsymbol\theta_t, \theta_0).

For convenience, we maximize the logarithm instead

l(\boldsymbol\theta, \theta_0) = \sum_{t=1}^n \log P(y_t|\mathbf{x}_t; \boldsymbol\theta_t, \theta_0)

Alternatively, we can minimize the negative logarithm as

-l(\boldsymbol\theta, \theta_0) = \sum_{t=1}^n \log \left( 1 + \exp(-y_t(\boldsymbol\theta^T \mathbf{x}_t + \theta_0)) \right)

The minimization of the above objective function (negative logarithm of joint likelihood) is convex, and thus we have a unique minima. An approach to solve the minima is by gradient descent. To specify the updates, we have the following derivatives

\frac{\partial}{\partial \theta_0} \left( 1 + \exp(-y_t(\boldsymbol\theta^T \mathbf{x}_t + \theta_0)) \right) = -y_t(1 - P(y_t | \mathbf{x}_t; \boldsymbol\theta, \theta_0))
\frac{\partial}{\partial \boldsymbol\theta} \left( 1 + \exp(-y_t(\boldsymbol\theta^T \mathbf{x}_t + \theta_0)) \right) = -y_t \mathbf{x}_t(1 - P(y_t | \mathbf{x}_t; \boldsymbol\theta, \theta_0))

Then the parameters can be iteratively updated as

\theta_0 \rightarrow \theta_0 + \eta \cdot y_t(1 - P(y_t | \mathbf{x}_t; \boldsymbol\theta, \theta_0))
\boldsymbol\theta \rightarrow \boldsymbol\theta + \eta \cdot y_t \mathbf{x}_t(1 - P(y_t | \mathbf{x}_t; \boldsymbol\theta, \theta_0))

where η is called learning rate.

Note that here 1 - P(y_t | \mathbf{x}_t; \boldsymbol\theta, \theta_0) is the probability of making a mistake estimated by the current model. When the optima is attained, we have

\frac{\partial}{\partial \theta_0} (-l(\boldsymbol\theta, \theta_0)) = -\sum_{t=1}^n y_t (1 - P(y_t | \mathbf{x}_t; \boldsymbol\theta, \theta_0)) = 0
\frac{\partial}{\partial \boldsymbol\theta} (-l(\boldsymbol\theta, \theta_0)) = -\sum_{t=1}^n y_t \mathbf{x}_t (1 - P(y_t | \mathbf{x}_t; \boldsymbol\theta, \theta_0)) = 0

The optimality of θ0 enforces the balance of mistakes for samples in both classes; while the optimality of \boldsymbol\theta implies that the mistake probabilities are orthogonal to all rows of labeled-sample matrix. Intuitively, this orthogonality ensure that there's no further linearly available information in the examples to improve the predicted probabilities.

Calibration

We show that the probability model given by logistic regression is calibrated.

We see that

\frac{\partial l}{\partial \theta_0} = \sum_{t=1}^n y_t \left(1 - p(y_t | \mathbf{x}_t, \boldsymbol\theta + \theta_0) \right) = 0,

which implies that

\frac{1}{n} = \sum_{t=1}^n \delta(y_t, 1) = \frac{1}{n} \sum_{t=1}^n p(y_t = 1|\mathbf{x}_t, \boldsymbol\theta, \theta_0).

This equation indicates that the empirical fraction of +1 labels equals the predicted fraction of +1 labels.

In this sense, we say logistic regression is binary calibrated. This is a weak form of calibration, which means that the probability model satisfies the constrains based on some statistics.

Regularized formulation

Imagine the case that the training samples are linearly separable. In this case, we can continuously maximize the joint likelihood by scaling up the parameters. As a consequence, the parameters will become unbounded and approach infinity. To address this issue, we can regularize the formulation as we have done in SVM to contrain the solution in a reasonable range.

To estimate the parameters of the logistic regression model with regularization, we minimize the following objective

\frac{\lambda}{2}||\boldsymbol\theta||^2 + \sum_{t=1}^n \log\left(1 + \exp(-y_t(\boldsymbol\theta^T \mathbf{x} + \theta_0)) \right),

where the coefficient λ specifies the trade-off between correct classification and regularization penalty.