Parametric density estimation

Problem and solution ideals

Density estimation using a distribution class

Suppose that parameter $t$ specifies the distribution $f_t(x_r|x_{\lnot r})$, where $x_{\lnot r}$ can be empty! Let $T$ be parameter space spanned by such $t$; it represents the class of distributions which can be specified in this form.

Given finite data set $\set{x^{(i)}}$, we want to approximate an unknown target distribution $D(x_r|x_{\lnot r})$ which may not belong to this distribution family using $T$. The approximation can be a weighted combination of combination of distributions in $T$.

Note that estimating $t$ which is good at predicting the value of $x_r$ given $x_{\lnot r}$ by doing $h(x_{\lnot r}) = \max_{x_r} f_t(x_r|x_{\lnot r})$ is a separate problem, where a different estimation procedure which minimizes the classification error $Pr(h(x_{\lnot r}) \neq x_r)$ (corresponding to the 0/1 classification loss) may be used.

Solution ideas

Empirical risk minimization, for various forumulations of risk functions which in someway also incorporate prior belief about the best $t$.

Approximation with Normal distribution

Aka Laplace approximation. Suppose that we have the probability distribution $p(x) = Z^{-1}f(x)$.

Algorithm

Here, one finds a mode / strict local maximum $x’$ of the distribution $p(x)$, which corresponds to a mode of $f(x)$, using numerical techniques. Then, one creates a normal distribution $N(\mean = x’, \stddev)$ around this point.

2nd order approximation of log f

$\log N(x’, \covmatrix)$ is a quadratic function. So, we try to find $\covmatrix$ such that $\log N$ approximates the 2nd order approximation of $\log f$.

As $\gradient f(x’)= 0$, taking the quadratic approximation $\log f(x) \approx \log f(x’) + (f(x’))^{-2} (x-x’)^{T}\gradient^{2} f(x’)(x - x’)$. So, $f(x) \approx f(x’)exp((f(x’))^{-2} (x-x’)^{T}\gradient^{2} f(x’)(x - x’))$

The RHS can now be used to construct $\covmatrix$. As $x’$ is a mode, $-2(f(x’))^{-2}\gradient^{2} f(x’) \succ 0$ as expected.

Properties

Estimating Z

Z can be approximated to equal the normailizer of N, which is easily calculated. Note that knowledge of Z is not required to get the approximation $N$.

Non-uniqueness

Different modes yield different approximate distributions.

Log loss minimization

Aka Maximum likelihood estimation (MLE).

Optimization problem, estimate

Likelihood function: $L:T \to [0,1]$. \ $L(t|\set{x^{(i)}}) = f_t(\set{x_r^{(i)}}|\set{x_{\lnot r}^{(i)}}) = \prod f_t(x^{(i)}|x_{\lnot r}^{(i)})$.

$\hat{t} = argmax_{t} L(t|x)$. This may be a biased estimator; but is always a sufficient statistic, as it is defined on $L(t|x)$. Often, an equivalent optimization problem: minimizing log-likelihood is used.

Functional Invariance property

If you want to estimate $g(t)$, $g(\hat{t})$ is the MLE of g(t). From definition.

Avg Log likelihood function

Take $l(t|\set{x^{(i)}}) = \ln(L(t|\set{x^{(i)}}))$, and do $\min_t -n^{-1}l(t|X)$. Useful as often $L$ and distribution of X are from exponential family.

Example: In case of $N(\mean, \stddev^{2})$, $\ln f(\set{x^{(i)}}|\mean, \stddev^{2}) = -\frac{1}{2\stddev^{2}}\sum(x^{(i)} - \mean)^{2} - \frac{N}{2}\ln (\frac{\stddev^{2}}{2 \pi})$; by maximization, MLE is $\mean = \bar{x}, \stddev^{2}= N^{-1}(x^{(i)} - \mean)^{2}$: because biased estimator is used, $N^{-1}(x^{(i)} - \mean)^{2}$ often underestimates.

Other perspectives

As log loss risk minimization

The negative log-likelihood of a single sample-point, $-\log L(t|x)$, is also called the ’log loss’ in the general decision theoretic framework. So, by doing maximum likelihood estimation, we are actually minimizing empirical log-loss.

Priors as regularizers

If you add regularizer $r(t)$, you are imposing a prior distribution on $t$; so you are doing bayesian inference. The optimization problem becomes: $\min l(t|X) + r(t)$.

As empirical code-length divergence minimization

Let \cF be a class of distributions, \ let $D$ be the actual distribution of $X$. In the limit where $n \to \infty$, maximum likelihood estimation tries to find $\argmin_{F \in \cF} E_D[- \log F(X)]$. This is the same problem as finding $\argmin_{F \in \cF} E_D[- \log F(X)] - E_D[-\log D(x)] = \argmin_{F \in \cF} KL(F||D)$. So we are finding a member of \cF, with minimum code-length divergence to $D$.

Derivatives of log likelihood

Score function : Sensitivity of log Likelihood

$V(t, X) = \gradient_t{\log L(t|X)} = L(t|X)^{-1} \gradient_t L(t|X)$: variability of $L(t|X)$ normalized by $L(t|X)$: like conditioning in numerical analysis.

Mean wrt X

Under some regularity conditions, \

$$E_X[V(t, X)] = \int_x (f_{X|T}(x|t)^{-1} \gradient_t f_{X|T}(x|t)) f_{X|T}(x|t))dx =\ \gradient_t \int_x f_{X|T}(x|t)) dx = \gradient_t 1 = 0$$.

Variance wrt X of sensitivity score of likelihood

Aka Fisher Information matrix. As $E_X[V(t, X)] = 0$, $I(t) = E_X[V(t, X)^{2}] = E_X[(\gradient_t{\log L(t|X)})^{2} | t]$. Measures information about t in the observable RV X. If I(t) is high for RV X, then the absolute value of the sensitivity score is high in expectation.

If conditions like those in $E_X[V(t, X)] = 0$ hold,\ $E_X[(f_{X|T}(x|t))^{-1} \gradient_t^{2} f_{X|T}(x|t)] = 0$ will hold, and so $I(t) = -E_X[(\gradient_t^{2}{\log L(t|X)})^{2} | t]$.

Computational cost

If this optimization problem can be accurately and efficiently: great!

Computing partition function

Suppose only $f(x, t)$ proportional to th epdf $f_t(x_r|x_{\lnot r})$ is specified. Doing \ $\max_r \frac{f(x, t)}{\sum_y f(y, t)}$ is not the same as doing $\max_r f(x, t)$, as the normalizer (aka partition function) $Z(t) = \sum_y f(y, t)$, even though independent of $x$, varies with $t$.

If range of $x$ is huge, and Z(t) does not have a closed-form solution, computing $Z(t) = \sum_y f(y, t)$ can become costly and MLE is impractical.

Ease in case of conditional probabilities

Suppose that $range(X_t)$ is actually small. Here, rather than having to sum over the entire range of $X$, which may be $range(x_r)^{|V|}$ in size, we just sum over $range(x_r)$ values to get $Z(t) = \sum_y f(y, t)$.

Pseudolikelihood maximization

In case finding/ maximizing $f_t(x_r|x_{\lnot r})$ is hard due to the need to compute $Z(t)$, but finding $f_{X_{r_j}|X_{\lnot r_j}}$ is easy. So we can consider maximizing the pseudo-likelihood function $\prod_{r_j} f_{X_{r_j}|X_{\lnot r_j}}$ instead.

Non uniform model for P(t)

Aka Maximum a posteriori (MAP).

Objective, estimate

Posterior probability of model considering observations $\propto$ likelihood of observations given the model $\times$ prior probability of model.

So, solve : $argmax_{t} f_{X|T}(x|t)f_T(t)$, or $argmax_{t} f_X(x)$ + regularizer.

Relation to MLE

MLE is a special case: it ignores prior distribution/ assumes it are uniform. So, often superior MLE. Eg: In MLE, upon seeing 4 H in coin tosses, you would conclude that the coin will always come up H.

The MLE optimization objective with a regularizer is a case of MAP, where the regularizer implitly defines a prior.

Defining prior distributions

Often, the prior distribution on $T$ may be specified, and the regularizer associated with the risk function may be derived thence.

Hyperparameters for prior distribution of parameters

$f_{T|C}$ may have hyperparameters C, which may inturn come from distribution $f_{C|D}$ with hyperparameters $D$. Eventually must fix (hyper)parameter, perhaps with n-fold cross validation.

Eg: Can have hyper-parameters: $\stddev$ for label noise , $\ga$ for prior over $\param$.

These are akin to parameters, but of a special kind.

Conjugate prior for a likelihood

You got distribution family $f_{X|T=t}$. Conjugacy: If $f_{X|T=t}$ and $f_T(t)$\ have the same form, finding $Pr(t|X) = f_{(X_{i})|t}((x_i))f_T(t) = \prod_{i} f_{X|T=t}(x_i)f_T(t)$ simpler. Then can update these probabilities with each incoming observation $X_{i}$ easily.

Eg: For k-categorical distribution: $Pr(X_{j}=x|p) = \prod_{i=1}^{k} p_{i}^{x_{i}}$, dirichlet distribution is the conjugate: $f_{P|A=a}(p) = \prod p_{i}^{a_{i}-1}$ for hyperparameters a.

Model combination

Aka Fully Bayesian approach. We may first determine posterior distribution over parameters $f_{T|S}$ in the training stage, where $S$ is the training set. We may then arrive at the model $f_{L|X, S} = \int_{t \in T} f_{L|X, T=t}f_{T|S}(t)$.

So you use an ensemble of hypothesis distribution models.

Information criteria

Bayesian/ Schwartz information criterion (BIC): $\min 2^{-1}L(D, t) + p \log n$, where $p$ is the number of parameters needed to specify $t$, and $n$ is the number of samples.

Use

These are good if certain ‘Gaussianness’ assumptions hold, otherwise, the result of using them can be misleading. They allow us to pick $t$ without having to, for example, do cross-validation, which would have been necessary if $\min L(D, t) + l r(t)$ were used instead.