Sampling a random variable X
Problem
The pseudo random number generator yields a sequence of almost independent random bits: see randomized alg ref. How do you use them to sample from a given distribution?
Visualization
Want to ‘cover’ the entire range(X) by the sampling: visualize as throwing darts in a oval: dart density corresponds to probability; the shape formed by the darts corresponds to the Pr(X) contour.
Challenges
Can’t sample from \(\Re\): computer can’t even store all possible \(\Re\). So, must sample from Q.
Or, Pr(X=x) or \(Pr(X=x|Y=y)\) may be difficult to calculate.
Sampling some distributions
Uniform distribution
Sampling from U[a,b] is easy: sample 1 bit at a time form the range.
Discrete probability distribution
Take a line segment [0, 1], partition it into ranges \(\set{R_i}\) corresponding to probabilities of various values; Sample x from U[0, 1]; identify its range j; output value j corresponding to \(R_j\).
Normal distribution
Use rejection sampling: accept x drawn from uniform distribution with probability proportional to \(e^{(x-\mean)^{T}\covmatrix^{-1}(x-\mean)}\). Can similarly sample from any distribution.
Or, sample from discrete probability distribution which approximates N: but this is costly as the number of rational numbers in the range is large.
Multidimensional Normal distribution
Sampling from a spherical normal distribution is easy: just apply a linear transformation to the inputs before sampling.
Problems solved by sampling X
The integration/ summation problem
Maybe you want to \(E[f(X)|T]\) where T is an event (maybe ’true’), but ye find it difficult to do this analytically. The expectation could involve a combinatorial sum: the RV f(X) can take on many different values with different probabilities.
Find normalization const Z of Pr(X)
Given prob distribution \(Pr(x) = Z^{-1}f(x)\) specified only by f(x) to find Z.
Finding Pr(T) for hard to analyze T
Same as finding the expectation of an indicator random variable.
Visualize sample space as an oval, and sampling as putting dots in the oval; you estimate the size of a smaller oval where the event is true.
Aka counting if every sample point is equally probable.
Optimization
Sampling with special attention being paid towards finding an optimum is considered elsewhere.
Maybe you want to find \(\min_{X} f(X)\). As seen in optimization ref, in general it is hard to find a global minimum; so repeated sampling / random restarts with gradient descent can be a useful strategy.
Optimization after summation
Maybe want \(\min_{X} E[f(X)|T]\).
Simulation
Maybe you want to simulate roll of dice, or want AI’s which behave realistically in games.
Sampling Random vectors
Arbitrary randomg vectors
Suppose random vector \(X\) is 3-dimensional, and suppose that we are given the joint potential \(\gf_X\). One can first deduce and sample \(x_1\) from \(f_{X_1}\), then \(x_2\) from \(f_{X_2|X_1 = x_1}\), and finally \(x_3\) from \(f_{X_3|X_1 = x_1, X_2 = x_2}\). The cost mostly comes from having to sum over \(|ran(X_2)||ran(X_3)|\) terms in deducing the distributions.
From graphical models
Forward sampling for DAG’s
Consider a directed graphical model with fully specified conditional probability distributions for non-root nodes and distributions for root nodes.
One simply samples successive each level of the DAG, starting at level 0, utilizing the conditional probability distributions given samples drawn from the previous level at each step.
This is an exact sampling technique.
Undirected trees
Sampling from undirected tree-structured graphical models can be accomplished by converting them efficiently to tree structured directed graphical models. Computing marginal probabilities by summing over pairwise potentials is efficient when \(range(X_i)\) is finite, which then makes computation of conditional probabilities for the DAG efficient.
Factor trees
One can forward-sample tree structured factor graphs by conversion to equivalent directed trees whose nodes correspond to factor-nodes.
Arbitrary undirected graphs
Arbitrary undirected graphs may be sampled by conversion to corresponding factor trees.
Repeated Sampling
Aka Monte Carlo sampling. Repeated sampling is done for several purposes, like inference. It is usually much simpler when Pr(X) is modelled by a graphical model.
Here we consider various techniques for repeated sampling themselves, rather than their special applications to problems such as inference (as in rejection sampling).
Markov chain Stationary distribution
Aka Markov Chain Monte Carlo (MCMC).
Consider a stochastic process whose state space is \(ran(X)\). For any distribution \(f_X\), we can usually construct markov chain whose stationary distribution is \(f_X\). So, doing appropriate random walks on the state transition graph of this Markov chain (see probabilistic models for details), one can sample various states from \(f_X\).
But before each sample is picked, one must ’throw away some samples’ in order to let the state distribution approximate the stationary distribution.
The art is in finding a fast mixing one. If it is time reversible, can get very fast mixing Markov chain.
Sampling from discrete distribution
Aka Metropolis alg.
Want to sample from distribution over \ \(range(X) = \set{v_{i}}\). Make ergodic Markov chain with right stationary distribution: Pick \(M \geq maxDegree\), set \(Pr(v_{i} \to v_{j}) = (1/M)min(1, \pi_{v_{j}}/\pi_{v_{i}})\) otherwise self-loop. Note that finding \(\pi_{v_{j}}/\pi_{v_{i}}\) usually easy : ye don’t need to know normalization constant Z.
Repeated conditional sampling
Aka Gibbs sampling.
Algorithm
Start with random assignment to \((X_i)\). For each i, sample \(X_i \distr Pr(X_i|X_{j\neq i})\); repeat.
The conditional probability distribution is computed using \\(f_{X_i|X_{-i}} = \gf_{X}/\gf_{X_{-i}} = \gf_{X}/\sum_{x_i} \gf_{X}\), where the joint potential \(\gf_X\) is not necessarily normalized.
Correctness
Why is this same as sampling from Pr(X)? \why
Efficiency
Gibbs sampling is efficient as long as the \(\gf_X\) can be efficiently computed. This is because of the fact that the normalization constant cancels out, and need not be computed; and becuase \(range(X_i)\) is limited.
Modifications
This is akin to a ‘random walk on a graph’ behavior, so often takes long to cover the entire sample space well. Consider \((X_1, X_2)\) with them being very correlated: visualize sample space as elongated ellipse around some line; so going L units away on \(X_i\) requires \(O(L^{2})\).
So, maybe use random restarts.
Block conditional sampling
Aka block-Gibbs sampling.
Algorithm
Here, we sample whole blocks of variables at a time. Consider a partition of variables \(X\) into \(A, B\). Starting with a random assignment, we sample from \(f_{A|B}\) and \(f_{B|A}\) in turn. \(f_{A|B} = \gf_X/\gf_B\), where \(\gf\) is the potential function which is not necessarily normalized.
Finding \(\gf_B(b) = \sum_a \gf_X(a, b)\) requires summing over \(\prod_i ran(B_i)\) items in general, which is not efficient.
Advantages
It is better than Gibbs sampling when the subgraphs can be more efficiently or accurately (perhaps even exactly) sampled from, and when summing over them is easy. This is true for example when the partitions correspond to tree-structured subgraphs of directed graphical models.
Physics-inspired sampling
Energy temperature analogy
Can write \(Pr(x) = Z^{-1}e^{-\frac{E(x)}{T}}\), where E is energy function, T is the temperature. \tbc