Exponential family of distributions
Generated by h and feature function, parametrized by t
\(x \in X \subseteq R^{d}\); Base measure \(h: R^{d} \to R\), feature extraction fn \(\ftr: R^{d} \to R^{k}\). \(p(x; t) \propto h(x)e^{\dprod{t,\ftr(x)}} \). Exponential family: \(\set{p(x; t) \forall t}\). \(\ftr_i()\) aka potential functions.
Can be written as \(p(x; t) = h(x)e^{\dprod{t,\ftr(x)} - G(t)} = Z^{-1}h(x)e^{\dprod{t,\ftr(x)}}\), where \(G(t) = \log \sum_{X}h(x)e^{\dprod{t,\ftr(x)}}\) is the log partition fn.
\(G(t)\) mayn’t always exist for any t \why, so define natural parameter space: \(N = \set{t \in R^{k}: -1 < G(t) < 1}\).
Canonical form.
Pick natural parameters such that j(t)=kt for any k: so non unique. Natural parameter space is convex. \why
Minimal parametrization.
If the features \(\gf_i(x)\) are not linearly independent, the exponential family is overparametrized. So, the same probability distribution can be expressed using multiple distinct parameter-vectors.
Undirected graphical model from exp family distribution
Let \(h(x) = 1\). \(e^{\dprod{t,\ftr(x)}} = \prod_i e^{t_i \ftr_i(x)}\): make cliques among terms involved in \(\ftr_i(x)\).
Maximum entropy distribution with given means
The optimization problem
\(\max_p H(x): E_{x \distr p}[\ftr(x)] = \mean, p \geq 0, 1^{T}p = 1\).
Equivalently, can use \(H(x) = |dom(X)| - KL(p, U)\) in the objective: so finding p closest to U with the given means.
Lagrangian form
\(\max_p \sum_x p(x)log(\frac{1}{p(x)}) + \sum_i t_i E_{x \distr p}[\ftr_i(x)] = \mean_i, p \geq 0, 1^{T}p = 1\); reduce it to \(\max - KL(p, p^{}) + \log Z\), where \(p^{}(x) = Z^{-1}e^{-\sum \sum_{i}t_{i}(\ftr_i(x) - \mean_i)}\).
So, the solution is a member of the exponential family generated by the base measure U and feature functions \(\ftr()\).
Closest distribution to h with given means
Solve \(\min_p KL(h, p): E_{x \distr p}[\ftr(x)] = \mean, p \geq 0, 1^{T}p = 1\) similarly to show that solution belongs to exponential family generated by h and \(\ftr\).
Parametrization by means
t then corresponds to the lagrange multipliers; whose value depends on \(\mean\); So, a distribution in the family can equivalently be parametrized by means \(\mean\).
Polytope of means
The set of all possible means forms a polytope; and finding a distribution from an exponential family G then often viewed as finding a point \(\mean\) in this polytope: see statistics ref.
Any \(\mean\) corresponds to some p.
Inverse Exponential decay for squared deviation from mean
Aka Normal distribution.
Importance
The ‘bell curve’ is often observed in nature. ‘Normal distribution’ happens to be a suitable small tailed distribution model for these phenomena.
It is also important due to the Central Limit Theorem: the estimator of the mean approaches the normal distribution as the number of samples increases.
1D case
pdf, cdf
\(X \sim N[\mu,\stddev^{2}]\): a location parameter and a scale parameter. Range(X) = R. Probability density (Gaussian) \ \(N[x | \mu,\stddev^{2}] = f_{\mu, \stddev^{2}}(X=x) = \frac{1}{\stddev \sqrt{2\pi}}exp(-\frac{(x-\mean)^{2}}{2\stddev^{2}})\).
Defined this way to ensure symmetry about \(\mean\), the mean: The bell curve; inverse exponential decay away from the mean;\ \(\frac{1}{2\stddev \pi}\) factor to ensure that \(\int_{-\infty}^{\infty} N[x | \mean,\stddev^{2}]dx = 1\) (aka Normalization), using Gaussian integral.
Thence confirm using direct integration: \(\int N[x | \mu,\stddev^{2}] x dx = \mean\). Also, using integration by parts, \(var[X] = E[X^{2}]-E[X]^{2} = \stddev^{2}\). Mode coincides with the mean.
Important densities
\(Pr(|X - \mean| \leq \stddev) \approx .68 \).
\(Pr(|X - \mean| \leq 2\stddev) \approx .95 \).
\(Pr(|X - \mean| \leq 3\stddev) \approx .997 \).
Standard Normal distribution
\(N(x|0,1) = \frac{1}{2 \pi}exp(-\frac{(x)^{2}}{2})\). Aka Z distribution. Very convenient as every normal distribution can be viewed as a standard normal rescaled and shifted.
CDF calculation
CDF \(F(x) = \int_{-\infty}^{x} f(x)dx\) looks like sigmoid fn curve, but has no closed form. So, to calculate CDF, convert to standard normal distribution, look up corresponding entry in table. So, Inverse Exponential tail bounds hold. \chk
Often convert \(X \distr N(\mean, \stddev^{2}) \to (\frac{X - \mean }{\stddev}) \distr N(0, 1)\) to use N(0, 1) CDF table to find CDF of X.
In Matlab, can use erfc and erfcinv. \(F(x) = 1/2 erfc(-u/\sqrt{2})\).
Moment generating function
$$M(t) = E[e^{tX}] = \int e^{tx}\frac{1}{\stddev \sqrt{2\pi}}exp(-\frac{(x-\mean)^{2}}{2\stddev^{2}}) dx \ = \int e^{t\mean + \stddev^{2}t^{2}/2}\frac{1}{\stddev \sqrt{2\pi}}exp(-\frac{(x-\mean - \stddev^{2}t)^{2}}{2\stddev^{2}}) dx \ = e^{t\mean + \stddev^{2}t^{2}/2}$$ by completing the squares.
Other properties
If \(\set{X_{i}}\) are iid \(N(\mean, \stddev^{2})\), \ using mgf, \(\sum a_{i}X_{i} \distr N(\sum a_{i}\mean, \sum a_{i}^{2}\stddev^{2})\).
Log concave distribution is close to Normal: Just visualize.
If \(X \sim N[0, 1]\), \(f_{0,1}\) is eigenfunction of the Fourier transform. \ Also, \(E[e^{sX^{2}}] = (1-2s)^{0.5}\) \why.
Multidimensional case
Definition with univariates
\(X \in R^{n}\) has multidimensional normal distribution if \(\forall a \in R^{n}: a^{T}X\) has a univariate normal distribution.
It is determined compeltely by \(\mean\) and covariance matrix \(\covmatrix\), as for any random vector X, \(a^{T}X\) has mean \(a^{T}\mean\) and variance \(a^{T}\covmatrix a\).
So, any subvector Y also has multivariate normal. \exclaim{So, all marginals are normal!}
Distribution
\(x \in R^{D}\). Suppose \(\covmatrix \succ 0\).
\(N(x|\mean, \covmatrix) = \frac{1}{(2\pi)^{D/2} |\covmatrix|^{1/2}} e^{-\frac{1}{2}(x - \mean)^{T}\covmatrix^{-1}(x-\mean)}\). Parameters: \(\covmatrix, \mean\).
Reformulations
\(tr((x - \mean)^{T}\covmatrix^{-1}(x-\mean)) = tr(\covmatrix^{-1}(x-\mean)(x - \mean)^{T})\).
Also, often writ as \(\propto e^{-2^{-1}x^{T}P_ix + \mean_i^{T}P_ix} = e^{-x^{T}Ax + b^{T}x}\). The mode-finding problem, given some sample points, is to find \(\mean = A^{-1}b\), then: see Gaussian graphical model inference.
Singular covariance matrix
If covariance matrix is singular, the expression is ill-defined. But, one can consider a low dimensional distribution embedded in a higher dimensional space.
Covariance matrix is symmetric
If \(\covmatrix\) is a covariance matrix, it must be symmetric. As no complex numbers are invloved, \(\covmatrix^{-1}\) in exponent can be taken to be symmetric; thence \(\covmatrix = \covmatrix^{T}\) assumable.
If \(\covmatrix \succeq 0\): \(\covmatrix = U\EW U^{*}, |\covmatrix|^{1/2} = \prod \ew_{i}^{1/2}\).
Geometric view
Take level set; \(N(x|\mean, \covmatrix) = c\), \ get \(\gD = (x - \mean)^{T}\covmatrix^{-1}(x-\mean) = (x - \mean)^{T}U^{*}\EW^{-1}U(x-\mean) = c’\): hyper-ellipse, with \(u_{i}\) as major axes, \(\ew_{i}^{1/2}\) as radii, \(\mean\) as center: see topology ref.
\(\gD\) is the Mahalonobis distance.
As product of univariate normal distribution
So, take \(y = U(x-\mean)\) as new axis. Then $$N(x|\mean, \covmatrix) = \ \frac{1}{(2\pi)^{D/2} \prod \ew_{i}^{1/2}} e^{-\frac{1}{2} \prod \frac{y_{j}^{2}}{\ew_{j}}}$$: thus factored into a product of univariate normal distribution variables.
A special case
If \((X_{i})\) are \(\perp\), \(\covmatrix\) is diagonal, then this boils down to product of univariate normal distributions, as expected.
Product of normal distributions
\(N(\mean_1, \covmatrix_1 = P_1^{-1}) + N(\mean_2, \covmatrix_2 = P_2^{-1}) = N(\mean, \covmatrix)\) with \(\covmatrix^{-1} = P = (P_1 + P_2)\), \(\mean = (\mean_1^{T}P_1 + \mean_2^{T}P_1)P^{-1}\): consider the form of the exponent: \(-2^{-1}\sum_i x^{T}P_ix + \mean_i^{T}P_ix\).
\(\infty\) dimensional Normal distribution
Aka Gaussian process. \(\infty\) dimensional distribution, whose content in every finite dimensional subspace holds a finite dimensional Normal distribution. A gaussian distribution on functions.
Gaussian graphical models
Uncorrelated variables
Take the graphical model graph G corresponding to the multidimensional normal distribution. Take precision matrix \(V = \covmatrix^{-1}\). \(V_{i,j} = 0\) corresponds to pairwise independence \(X_i \perp X_j| X_{V - \set{i,j}}\) in G, and to the factorization \(f_X(x) = \prod f_{i,j: V_{i,j} \neq 0}(x_i, x_j)\).
Inference in this graphical model is often interesting: it is equivalent to solving Ax = b for symmetric a.
1D distributions from exponential family
Polynomial rise with inverse exponential decay for largeness
Aka Gamma distribution gamma(a, b).
Models time to complete some task.
pdf \(f_X(x) = \frac{b^{-a}x^{a-1}e^{-x/b}}{\gG(a)} \propto x^{a-1}e^{-x/b}\) for \(x \in [0, \infty]\): Pf: using \(\gG()\) defn. Mean \(\mean = ab; var[X] = ab^{2}\) using \(\gG()\) properties. By finding critical point of f(x), mode is b(a-1).
This is a single mode distribution.
Inverse Gamma distribution
The distribution of U = 1/X, where \(X \distr gamma(a, b)\).
This is again a single mode distribution.
Exponential decay distribution \(expo(\mean)\)
pdf: \(f(x; m) = m^{-1}e^{-x/\mean}\) for \(x \in [0, \infty]\). Same as \(gamma(1, \mean)\).
CDF \(F(x) = 1 - e^{-x/\mean}\) iff \(x\geq 0\).
\(m\) is mean by integration by parts.
\(var[X] = m^{2}\): use integration by parts to find \(E[X^{2}]\); find \(E[X^{2}] - m^{2}\).
Shift parameter
Note that analogous distributions \(f(x;t, m)\) can be defined with support \(x \in [t, \infty]\) by adding a location parameter \(t\).
Bilateral-exponential decay distribution
Aka double-exponential decay distribution. Aka Laplace distribution.
pdf \(f(x; t, m) = \frac{1}{2m} e^{-|x - t|/m}\).
The graph
Mode is 0. You get an exponential tail. \(\frac{1}{\mean}\) controls when the exponential decay kicks in. Aka exponential clock.