Monte Carlo Sampling

From Dahuawiki

Revision as of 02:24, 18 September 2007; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

Monte Carlo simulation gains increasing popularity in the past decade and has played significant roles in computational science, statistics, and even physics. Its basic principle is to approximate a complex distribution using a finite set of samples. Generally, the family of Monte Carlo methods comprises two related aspects: sampling from a distribution and using the samples to accomplish some tasks.

In the evoluation of the Monte Carlo methodology, the development of Metropolis-Hastings algorithm is acknowledged as the most influential progress, which lies at the heart of a famous class of methods called Markov Chain Monte Carlo (MCMC) that are widely employed in practice.

Contents

Basic Principles of Monte Carlo

The basic idea of Monte Carlo simulation is to approximate the target distribution p(x) by randomly drawing an i.i.d set of samples from it. The approximation based on a set of N samples is given by

p_N(x) = \frac{1}{N} \sum_{i=1}^N \delta(x - x_i)

Here, the samples x_1, x_2, \ldots, x_N are independently drawn from p(x).

Then we can approximate the integrals over propability space with the sum of finite terms

I(f) = \int_{\mathcal{X}} f(x) p(x) dx \ \simeq \ I_N(f) = \frac{1}{N} \sum_{i=1}^N f(x_i).

It has been proved that IN(f) will almost surely converges to I(f) as the number of samples n increases.

In addition to computing integrals, Monte Carlo can also be applied in perform optimization as

\hat{x} = \operatorname{argmax}_{i = 1,\ldots,N} \ f(x^{(i)}).

This is closely related to a widely used global optimization strategy: simulated annealing.

Rejection Sampling

It is sometimes much more easier to sample from a distribution q(x) than to directly sample from the target distribution p(x). Rejection sampling is a simple method to perform Monte Carlo by sampling from the proposal distribution q(x) and rejecting the samples in a certain probability to simulate the sampling directly done on p(x).

Given a target distribution p(x), we first find a proposal distribution q(x), and a positive scalar M, such that p(x) \le M q(x). Then the sampling procedure is to draw samples from q(x) and accept each sample with the probability \frac{p(x)}{Mq(x)}.

For rejection sampling, the overall acceptance probability is

E_{q(x)}\left\{\frac{p(x)}{M q(x)}\right\} = \frac{1}{M}

It means that if M is large, the sampling efficiency will be low. The difficulty is how to choose a suitable M such that p(x) \le M q(x) is satisfied and it is not too large.

Importance Sampling

Importance sampling is a classic method in Monte Carlo sampling, which is the root of many widely-used sampling methods like Metropolis-Hastings algorithm and Particle filters.

Like rejection sampling, it samples from a proposal distribution q(x), which is also known as importance distribution, and assigns each sample a weight. The rationale is simple as in the formula

I(f) = \int f(x) p(x) dx = \int f(x) w(x) q(x) dx

here, w(x) = p(x) / q(x) is the weighting function. The Monte Carlo estimate can be thereon derived as

I_N(f) = \sum_{i=1}^N w(x_i) f(x_i), \quad w(x_i) = \frac{p(x_i)}{q(x_i)}.

Optimal importance distribution

Generally, it is desirable to choose an importance distribution q(x) with minimum variance. In that way we can achieve high sampling efficiency. The variance of f(x)w(x) with respect to q(x) is given by

\operatorname{var}_{q(x)}\{f(x) w(x)\} = E_{q(x)}\{f^2(x) w^2(x)\} - I^2(f)

Only the first term depends on q(x), which is bounded by

E_{q(x)}\{f^2(x) w^2(x)\} \ge \left(E_{q(x)} \{|f(x)|w(x)\}\right)^2 = \left(\int |f(x)| p(x) dx\right)^2.

The lower bound is attained when f(x)|w(x) is a constant. Hence the optimal importance distribution that minimizes the variance is

q^*(x) = \frac{|f(x)| p(x)}{\int |f(x)| p(x) dx}

This indicates that the optima is achieved when all weighted terms are balanced.

Notes

  1. Using the optimal importance distribution is more efficient than using the original distribution p(x). The intuition is that by taking f(x) into account, the optimal importance sampling tends to draw the samples with large |f(x)| rather than wasting time sampling in the region of no utility.
  2. It is often difficult to directly sample in the optimal importance distribution. To address this issue, more sophisticated schemes are developed, including adaptive importance sampling and MCMC.

The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm (M-H algorithm) is first proposed by Metropolis in 1953, and then generalized to be the current form by Hastings in 1970. Different from traditional sampling process in which the samples are drawn independently, M-H algorithm draws each sample with the probability conditioned on the previous sample. Thereby, the process essentially form a Markov chain. The basic idea of M-H algorithm is to use the equilibrium distribution of the Markov chain to simulate the target distribution. In this sense, it is called a Markov Chain Monte Carlo (MCMC) method.

Given a target distribution p(x) and proposal conditional distribution q(x' | x), the algorithm can be briefly described as follows

1. Initialize x0
2. Repeat
   (a) Sample x^* \sim q(x^* | x_i)
   (b) Set A(x_i, x^*) = \min\{1, \frac{p(x^*)q(x_i|x^*)}{p(x_i)q(x^*|x_i)}\}
   (c) Accept the new sample with probability A(x *  | xi).
   (d) Set i = i + 1 if the new sample is accepted.
3. Stop until sufficient samples are drawn.

It has been proved that the equilibrium distributin of the Markov process derived by the above algorithm is p(x), if the Markov process is irreducible and aperiodical.

Specially, its the proposal distribution is symmetry as q(xi | x * ) = q(x * | xi), the Metropolis-Hastings algorithm degenerates to the original Metropolis algorithm, in which the acceptance ratio is given by

A(x_i, x^*) = \min\left\{1, \frac{p(x^*)}{p(x_i)} \right\},

which only depends on the ratio between the probability of the next sample and the current sample.

Gibbs Sampler

Gibbs sampler is a widely used sampling method in practice. It can be considered as a special case of M-H algorithm in which only one value is changed at each sampling and that value is drawn from the conditional probability as

q(x^* | x_i) =  \begin{cases} p(x^*(j) | x_i(-j)) & (x^*(-j) = x_i(-j)); \\ 0 & (otherwise).  \end{cases}

Here x(j) refers to the j-th value of x, while x(-j) refers to all other values of x except j. In the cases that the value of x(j) only depends on a subset of other values, then we can further rewrite the proposal function as

q(x^* | x_i) =  \begin{cases} p(x^*(j) | x_i(\mathcal{S}(j))) & (x^*(-j) = x_i(-j)); \\ 0 & (otherwise).  \end{cases}

Here, \mathcal{S}(j) denotes the subset of variables (except j) which the value of j depends on.

With this proposal function, we can derive the acceptance ratio as

A(x_i, x^*)  = \min\left\{1, \frac{p(x^*)p(x_i(j) | x_i(-j))}{p(x_i) p(x^*(j)|x^*(-j))}\right\} = \min\left\{1, \frac{p(x^*(-j))}{p(x_i(-j))}\right\} = 1

Hence, the Gibbs sampling process can be briefly described as a procedure that draws a value of x at each time from the conditional distribution of that value depending on the other values of x.

If the dependency between the values of x forms a directed acyclic graph (DAG), we can start the updating from the nodes without parents, and then update the children of these nodes, and so on. If the dependency forms a cyclic graph, we need to repeatedly update the values in random order or any specific order until convergence, which is called burn in.

Personal tools