Density estimation

Importance

Fitting a model to observations, ie picking a probability distribution from a family of distributions, is an important component of many statistics tasks where one reasons about uncertainty by explicitly using probability theory. Such tasks are labeled ‘Bayesian inference’.

Choosing the distribution family

Observe empirical distribution

Draw a bar graph, see what the curve looks like.

Given expected values of fns ${E [ϕ_{i (X)}] = μ_{i}}$ and a base measure h

Suppose we want to modify h as little as possible, under KL divergence, so that it has $E [ϕ (X)] = μ$ . If h is U, then this is same as finding a maximum entropy distribution with $E [ϕ (X)] = μ$ . Then, the solution belongs to the exponential family generated by h and $ϕ ()$ : see probabilistic models ref.

Given dependence among features

Use graphical models - see probability ref.

Parametric density estimation

Described in a separate chapter.

Non parametric Probability Density estimation

Estimate distribution on input space using $N$ samples $(x_{i})$ , without limiting attention to a certain set of distributions.

Histogram and the Kernel histogram

A distribution from bar-graph of frequency vs input interval. Can simply use a histogram.

Kernel density estimation

(Parzen). $p (x) = \frac{1}{N h} \sum_{i = 1}^{N} K (\frac{x - x_{i}}{h})$ , for kernel K. A smoothening of the histogram using kernels.

Kernel function for density estimation

Non negative real valued integrable K(x) satisfies: $\int_{- \infty}^{+ \infty} K (u) d u = 1$ (ensures PDF qualities during density estimation); $K (- u) = u$ (ensures mean of the PDF is the data point during density estimation). So, K(x) akin to kernel of an integral operator: see functional analysis ref.

Using Gaussian radial basis functions

Aka Gaussian kernel.\ $K (u) = \frac{1}{2 π} e^{- \frac{u^{2}}{2}}$ : Gaussian function with mean 0, variance 1.

Taking 1 Gaussian distribution/ adding 1 bump for each data point. h, controlling the variance of the bump, called the smoothing parameter/ bandwidth.

(Can approximate any distribution by mixture of Gaussians!)

Estimate probability measures

Use empirical measures

Empirical measure

The estimated measure using $n$ samples for a given function is $v_{n} (f | D) = n^{- 1} \sum I_{f} (X_{i})$ , where $D = {X_{i}}$ is a set of samples drawn iid from $v$ .

Goodness of estimate: single event

By law of large numbers, as $lim_{n \to \infty} E [\frac{\sum f (X_{i})}{n}] = E_{X} [f (X)] = v (f)$ . Also, we can use the Hoeffding inequality (see probabilistic models survey) to bound $P r_{D} (| v (f) - v_{n} (f | D) | \geq ϵ)$ and see that it decreases exponentially with increasing $n$ and $ϵ$ .

Bound variability in estimate

From central limit theorem, we know that, as $n \to \infty$ , the sample distribution approaches $N (v (f), \frac{σ}{\sqrt{n}})$ .

Also, we can use: $P r_{D, D^{'}} (| v_{n} (f | D) - v_{n} (f | D^{'}) | \geq ϵ) \leq P r_{D, D^{'}} (| v_{n} (f | D) - v (f) | \geq ϵ / 2 \lor | v_{n} (f | D^{'}) - v (f) | \geq ϵ / 2) = 2 P r (| v_{n} (f | D) - v (f) | \leq ϵ / 2)$ and use the Hoeffding inequality again.

Goodness of estimate for a class of events

Let $F = {f}$ be a class of events (or binary functions) defined on the input space $X$ , on which $v$ is a measure. Let $E_{F} (m)$ be max dichotomy count: see boolean functions survey.

(Vapnik, Chervonenkis).

Then $P r (sup_{f \in F} | v_{n} (f | D) - v (f) | > ϵ) \leq 8 E_{F} (n) e x p (\frac{- n ϵ^{2}}{32})$ .

Proof

If we were to use the union bound and the Hoeffding inequality naively, we would have a factor of $2 | F | » 8 E_{F} (n)$ on the RHS. So, we want to get to a point where we need only take the union bound over a small subset of $F$

So, first we show that \ $P r_{D} (sup_{f \in F} | v_{n} (f | D) - v (f) | > ϵ) \leq 2 P r_{D, D^{'}} (sup_{f \in F} | v_{n} (f | D) - v_{n} (f | D^{'}) | > ϵ / 2)$ , for sample set $D^{'}$ acquired in the same way as $D$ . Lemma

Proof

$P r_{D, D^{'}} (sup_{f \in F} | v_{n} (f | D) - v_{n} (f | D^{'}) | > ϵ / 2) \geq P r_{D} (sup_{f \in F} | v_{n} (f | D) - v (f) | > ϵ) P r_{D^{'}} (| v_{n} (f | D^{'}) - v_{n} (f) | < ϵ / 2 | | v_{n} (f^{'} | D) - v (f^{'}) | > ϵ)$ ; and the latter factor can be seen to be small: $< 1 / 2$ using Chebyshev inequality. This is called the ‘ghost sample technique’.

So, now we need only bound $P r_{D, D^{'}} (sup_{f \in F} | v_{n} (f | D) - v_{n} (f | D^{'}) | > ϵ / 2)$ . One way to deal with this is to take the union bound now over the set of all dichotomies induced over $D \cup D^{'}$ to get the bound: $E_{F} (2 n) P r (D, D^{'}) (| v_{n} (f | D) - v_{n} (f | D^{'}) | > ϵ / 2)$ , which can then be bounded as described elsewhere.}

Sharper bound

The sharper bound we require can be obtained using a different analysis, involving $E_{F} (n)$ instead. The process of picking $D, D^{'}$ and calculating $n^{-1}|\sum_i f(X_i) - \sum_i f(X’i)|$ is equivalent to the process of picking $D, D^{'}$ , and then picking $n$ bits $s$ , and then finding $n^{-1}|s_i(\sum_i f(X_i) - \sum_i f(X’i))|$ : in other words, we do this $n$ times: pick a pair of points and assign it to $D$ and $D^{'}$ at random. So, we bound $Pr{D, D’,s }(\sup{f \in F} n^{-1}|s_i(\sum_i f(X_i) - \sum_i f(X’i))|> \eps/2) \leq Pr{D, D’,s }(\sup_{f \in F} [|n^{-1}\sum_i s_if(X_i)| \geq \eps/4 \lor |n^{-1}\sum_i -s_if(X’i)| \geq \eps/4]) = 2Pr{D, s}(\sup_{f \in F} |n^{-1}\sum_i s_if(X_i)| \geq \eps/4)$. A union bound over $E_{F} (n)$ members of $F$ and a Hoeffding inequality can now be applied to bound this.

Estimate CDF using empirical CDF

(Glivenko Cantelli). Pick ${Z_{i}} \sim F$ , the CDF. \ Then $P r (sup_{z} | F_{n} (z) - F (z) | > ϵ) \leq 8 (n + 1) e x p (\frac{- n ϵ^{2}}{32})$ . Pf: Apply VCD theorem with open intervals as classifiers.

Answer to: What $n$ do you need to achieve low error? Thence Borel - Cantelli: $lim_{n \to \infty} sup | F_{n} (z) - F (z) | = 0$ with probability 1.

Importance

Choosing the distribution family

Observe empirical distribution

Given expected values of fns {E[ϕi(X)]=μi} and a base measure h

Given dependence among features

Parametric density estimation

Non parametric Probability Density estimation

Histogram and the Kernel histogram

Kernel density estimation

Kernel function for density estimation

Using Gaussian radial basis functions

Estimate probability measures

Use empirical measures

Empirical measure

Goodness of estimate: single event

Bound variability in estimate

Goodness of estimate for a class of events

Sharper bound

Estimate CDF using empirical CDF

Given expected values of fns ${E [ϕ_{i (X)}] = μ_{i}}$ and a base measure h