04 Exponential families

Exponential family of distributions

Generated by h and feature function, parametrized by t

\(x \in X \subseteq R^{d}\); Base measure \(h: R^{d} \to R\), feature extraction fn \(\ftr: R^{d} \to R^{k}\). \(p(x; t) \propto h(x)e^{\dprod{t,\ftr(x)}} \). Exponential family: \(\set{p(x; t) \forall t}\). \(\ftr_i()\) aka potential functions.

Can be written as \(p(x; t) = h(x)e^{\dprod{t,\ftr(x)} - G(t)} = Z^{-1}h(x)e^{\dprod{t,\ftr(x)}}\), where \(G(t) = \log \sum_{X}h(x)e^{\dprod{t,\ftr(x)}}\) is the log partition fn.

\(G(t)\) mayn’t always exist for any t \why, so define natural parameter space: \(N = \set{t \in R^{k}: -1 < G(t) < 1}\).

Canonical form.

Pick natural parameters such that j(t)=kt for any k: so non unique. Natural parameter space is convex. \why

Minimal parametrization.

If the features \(\gf_i(x)\) are not linearly independent, the exponential family is overparametrized. So, the same probability distribution can be expressed using multiple distinct parameter-vectors.

Undirected graphical model from exp family distribution

Let \(h(x) = 1\). \(e^{\dprod{t,\ftr(x)}} = \prod_i e^{t_i \ftr_i(x)}\): make cliques among terms involved in \(\ftr_i(x)\).

Maximum entropy distribution with given means

The optimization problem

\(\max_p H(x): E_{x \distr p}[\ftr(x)] = \mean, p \geq 0, 1^{T}p = 1\).

Equivalently, can use \(H(x) = |dom(X)| - KL(p, U)\) in the objective: so finding p closest to U with the given means.

Lagrangian form

\(\max_p \sum_x p(x)log(\frac{1}{p(x)}) + \sum_i t_i E_{x \distr p}[\ftr_i(x)] = \mean_i, p \geq 0, 1^{T}p = 1\); reduce it to \(\max - KL(p, p^{}) + \log Z\), where \(p^{}(x) = Z^{-1}e^{-\sum \sum_{i}t_{i}(\ftr_i(x) - \mean_i)}\).

So, the solution is a member of the exponential family generated by the base measure U and feature functions \(\ftr()\).

Closest distribution to h with given means

Solve \(\min_p KL(h, p): E_{x \distr p}[\ftr(x)] = \mean, p \geq 0, 1^{T}p = 1\) similarly to show that solution belongs to exponential family generated by h and \(\ftr\).

Parametrization by means

t then corresponds to the lagrange multipliers; whose value depends on \(\mean\); So, a distribution in the family can equivalently be parametrized by means \(\mean\).

Polytope of means

The set of all possible means forms a polytope; and finding a distribution from an exponential family G then often viewed as finding a point \(\mean\) in this polytope: see statistics ref.

Any \(\mean\) corresponds to some p.

Inverse Exponential decay for squared deviation from mean

Aka Normal distribution.

Importance

The ‘bell curve’ is often observed in nature. ‘Normal distribution’ happens to be a suitable small tailed distribution model for these phenomena.

It is also important due to the Central Limit Theorem: the estimator of the mean approaches the normal distribution as the number of samples increases.

1D case

pdf, cdf

\(X \sim N[\mu,\stddev^{2}]\): a location parameter and a scale parameter. Range(X) = R. Probability density (Gaussian) \ \(N[x | \mu,\stddev^{2}] = f_{\mu, \stddev^{2}}(X=x) = \frac{1}{\stddev \sqrt{2\pi}}exp(-\frac{(x-\mean)^{2}}{2\stddev^{2}})\).

Defined this way to ensure symmetry about \(\mean\), the mean: The bell curve; inverse exponential decay away from the mean;\ \(\frac{1}{2\stddev \pi}\) factor to ensure that \(\int_{-\infty}^{\infty} N[x | \mean,\stddev^{2}]dx = 1\) (aka Normalization), using Gaussian integral.

Thence confirm using direct integration: \(\int N[x | \mu,\stddev^{2}] x dx = \mean\). Also, using integration by parts, \(var[X] = E[X^{2}]-E[X]^{2} = \stddev^{2}\). Mode coincides with the mean.

Important densities

\(Pr(|X - \mean| \leq \stddev) \approx .68 \).

\(Pr(|X - \mean| \leq 2\stddev) \approx .95 \).

\(Pr(|X - \mean| \leq 3\stddev) \approx .997 \).

Standard Normal distribution

\(N(x|0,1) = \frac{1}{2 \pi}exp(-\frac{(x)^{2}}{2})\). Aka Z distribution. Very convenient as every normal distribution can be viewed as a standard normal rescaled and shifted.

CDF calculation

CDF \(F(x) = \int_{-\infty}^{x} f(x)dx\) looks like sigmoid fn curve, but has no closed form. So, to calculate CDF, convert to standard normal distribution, look up corresponding entry in table. So, Inverse Exponential tail bounds hold. \chk

Often convert \(X \distr N(\mean, \stddev^{2}) \to (\frac{X - \mean }{\stddev}) \distr N(0, 1)\) to use N(0, 1) CDF table to find CDF of X.

In Matlab, can use erfc and erfcinv. \(F(x) = 1/2 erfc(-u/\sqrt{2})\).

Moment generating function

$$M(t) = E[e^{tX}] = \int e^{tx}\frac{1}{\stddev \sqrt{2\pi}}exp(-\frac{(x-\mean)^{2}}{2\stddev^{2}}) dx \ = \int e^{t\mean + \stddev^{2}t^{2}/2}\frac{1}{\stddev \sqrt{2\pi}}exp(-\frac{(x-\mean - \stddev^{2}t)^{2}}{2\stddev^{2}}) dx \ = e^{t\mean + \stddev^{2}t^{2}/2}$$ by completing the squares.

Other properties

If \(\set{X_{i}}\) are iid \(N(\mean, \stddev^{2})\), \ using mgf, \(\sum a_{i}X_{i} \distr N(\sum a_{i}\mean, \sum a_{i}^{2}\stddev^{2})\).

Log concave distribution is close to Normal: Just visualize.

If \(X \sim N[0, 1]\), \(f_{0,1}\) is eigenfunction of the Fourier transform. \ Also, \(E[e^{sX^{2}}] = (1-2s)^{0.5}\) \why.

Multidimensional case

Definition with univariates

\(X \in R^{n}\) has multidimensional normal distribution if \(\forall a \in R^{n}: a^{T}X\) has a univariate normal distribution.

It is determined compeltely by \(\mean\) and covariance matrix \(\covmatrix\), as for any random vector X, \(a^{T}X\) has mean \(a^{T}\mean\) and variance \(a^{T}\covmatrix a\).

So, any subvector Y also has multivariate normal. \exclaim{So, all marginals are normal!}

Distribution

\(x \in R^{D}\). Suppose \(\covmatrix \succ 0\).

\(N(x|\mean, \covmatrix) = \frac{1}{(2\pi)^{D/2} |\covmatrix|^{1/2}} e^{-\frac{1}{2}(x - \mean)^{T}\covmatrix^{-1}(x-\mean)}\). Parameters: \(\covmatrix, \mean\).

Reformulations

\(tr((x - \mean)^{T}\covmatrix^{-1}(x-\mean)) = tr(\covmatrix^{-1}(x-\mean)(x - \mean)^{T})\).

Also, often writ as \(\propto e^{-2^{-1}x^{T}P_ix + \mean_i^{T}P_ix} = e^{-x^{T}Ax + b^{T}x}\). The mode-finding problem, given some sample points, is to find \(\mean = A^{-1}b\), then: see Gaussian graphical model inference.

Singular covariance matrix

If covariance matrix is singular, the expression is ill-defined. But, one can consider a low dimensional distribution embedded in a higher dimensional space.

Covariance matrix is symmetric

If \(\covmatrix\) is a covariance matrix, it must be symmetric. As no complex numbers are invloved, \(\covmatrix^{-1}\) in exponent can be taken to be symmetric; thence \(\covmatrix = \covmatrix^{T}\) assumable.

If \(\covmatrix \succeq 0\): \(\covmatrix = U\EW U^{*}, |\covmatrix|^{1/2} = \prod \ew_{i}^{1/2}\).

Geometric view

Take level set; \(N(x|\mean, \covmatrix) = c\), \ get \(\gD = (x - \mean)^{T}\covmatrix^{-1}(x-\mean) = (x - \mean)^{T}U^{*}\EW^{-1}U(x-\mean) = c’\): hyper-ellipse, with \(u_{i}\) as major axes, \(\ew_{i}^{1/2}\) as radii, \(\mean\) as center: see topology ref.

\(\gD\) is the Mahalonobis distance.

As product of univariate normal distribution

So, take \(y = U(x-\mean)\) as new axis. Then $$N(x|\mean, \covmatrix) = \ \frac{1}{(2\pi)^{D/2} \prod \ew_{i}^{1/2}} e^{-\frac{1}{2} \prod \frac{y_{j}^{2}}{\ew_{j}}}$$: thus factored into a product of univariate normal distribution variables.

A special case

If \((X_{i})\) are \(\perp\), \(\covmatrix\) is diagonal, then this boils down to product of univariate normal distributions, as expected.

Product of normal distributions

\(N(\mean_1, \covmatrix_1 = P_1^{-1}) + N(\mean_2, \covmatrix_2 = P_2^{-1}) = N(\mean, \covmatrix)\) with \(\covmatrix^{-1} = P = (P_1 + P_2)\), \(\mean = (\mean_1^{T}P_1 + \mean_2^{T}P_1)P^{-1}\): consider the form of the exponent: \(-2^{-1}\sum_i x^{T}P_ix + \mean_i^{T}P_ix\).

\(\infty\) dimensional Normal distribution

Aka Gaussian process. \(\infty\) dimensional distribution, whose content in every finite dimensional subspace holds a finite dimensional Normal distribution. A gaussian distribution on functions.

Gaussian graphical models

Uncorrelated variables

Take the graphical model graph G corresponding to the multidimensional normal distribution. Take precision matrix \(V = \covmatrix^{-1}\). \(V_{i,j} = 0\) corresponds to pairwise independence \(X_i \perp X_j| X_{V - \set{i,j}}\) in G, and to the factorization \(f_X(x) = \prod f_{i,j: V_{i,j} \neq 0}(x_i, x_j)\).

Inference in this graphical model is often interesting: it is equivalent to solving Ax = b for symmetric a.

1D distributions from exponential family

Polynomial rise with inverse exponential decay for largeness

Aka Gamma distribution gamma(a, b).

Models time to complete some task.

pdf \(f_X(x) = \frac{b^{-a}x^{a-1}e^{-x/b}}{\gG(a)} \propto x^{a-1}e^{-x/b}\) for \(x \in [0, \infty]\): Pf: using \(\gG()\) defn. Mean \(\mean = ab; var[X] = ab^{2}\) using \(\gG()\) properties. By finding critical point of f(x), mode is b(a-1).

This is a single mode distribution.

Inverse Gamma distribution

The distribution of U = 1/X, where \(X \distr gamma(a, b)\).

This is again a single mode distribution.

Exponential decay distribution \(expo(\mean)\)

pdf: \(f(x; m) = m^{-1}e^{-x/\mean}\) for \(x \in [0, \infty]\). Same as \(gamma(1, \mean)\).

CDF \(F(x) = 1 - e^{-x/\mean}\) iff \(x\geq 0\).

\(m\) is mean by integration by parts.

\(var[X] = m^{2}\): use integration by parts to find \(E[X^{2}]\); find \(E[X^{2}] - m^{2}\).

Shift parameter

Note that analogous distributions \(f(x;t, m)\) can be defined with support \(x \in [t, \infty]\) by adding a location parameter \(t\).

Bilateral-exponential decay distribution

Aka double-exponential decay distribution. Aka Laplace distribution.

pdf \(f(x; t, m) = \frac{1}{2m} e^{-|x - t|/m}\).

The graph

Mode is 0. You get an exponential tail. \(\frac{1}{\mean}\) controls when the exponential decay kicks in. Aka exponential clock.