Density estimation

Importance

Fitting a model to observations, ie picking a probability distribution from a family of distributions, is an important component of many statistics tasks where one reasons about uncertainty by explicitly using probability theory. Such tasks are labeled ‘Bayesian inference’.

Choosing the distribution family

Observe empirical distribution

Draw a bar graph, see what the curve looks like.

Given expected values of fns {E[ϕi(X)]=μi} and a base measure h

Suppose we want to modify h as little as possible, under KL divergence, so that it has E[ϕ(X)]=μ. If h is U, then this is same as finding a maximum entropy distribution with E[ϕ(X)]=μ. Then, the solution belongs to the exponential family generated by h and ϕ(): see probabilistic models ref.

Given dependence among features

Use graphical models - see probability ref.

Parametric density estimation

Described in a separate chapter.

Non parametric Probability Density estimation

Estimate distribution on input space using N samples (xi), without limiting attention to a certain set of distributions.

Histogram and the Kernel histogram

A distribution from bar-graph of frequency vs input interval. Can simply use a histogram.

Kernel density estimation

(Parzen). p(x)=1Nhi=1NK(xxih), for kernel K. A smoothening of the histogram using kernels.

Kernel function for density estimation

Non negative real valued integrable K(x) satisfies: +K(u)du=1 (ensures PDF qualities during density estimation); K(u)=u (ensures mean of the PDF is the data point during density estimation). So, K(x) akin to kernel of an integral operator: see functional analysis ref.

Using Gaussian radial basis functions

Aka Gaussian kernel.\ K(u)=12πeu22: Gaussian function with mean 0, variance 1.

Taking 1 Gaussian distribution/ adding 1 bump for each data point. h, controlling the variance of the bump, called the smoothing parameter/ bandwidth.

(Can approximate any distribution by mixture of Gaussians!)

Estimate probability measures

Use empirical measures

Empirical measure

The estimated measure using n samples for a given function is vn(f|D)=n1If(Xi), where D={Xi} is a set of samples drawn iid from v.

Goodness of estimate: single event

By law of large numbers, as limnE[f(Xi)n]=EX[f(X)]=v(f). Also, we can use the Hoeffding inequality (see probabilistic models survey) to bound PrD(|v(f)vn(f|D)|ϵ) and see that it decreases exponentially with increasing n and ϵ.

Bound variability in estimate

From central limit theorem, we know that, as n, the sample distribution approaches N(v(f),σn).

Also, we can use: PrD,D(|vn(f|D)vn(f|D)|ϵ)PrD,D(|vn(f|D)v(f)|ϵ/2|vn(f|D)v(f)|ϵ/2)=2Pr(|vn(f|D)v(f)|ϵ/2) and use the Hoeffding inequality again.

Goodness of estimate for a class of events

Let F={f} be a class of events (or binary functions) defined on the input space X, on which v is a measure. Let EF(m) be max dichotomy count: see boolean functions survey.

(Vapnik, Chervonenkis).

Then Pr(supfF|vn(f|D)v(f)|>ϵ)8EF(n)exp(nϵ232).

Proof

If we were to use the union bound and the Hoeffding inequality naively, we would have a factor of 2|F|»8EF(n) on the RHS. So, we want to get to a point where we need only take the union bound over a small subset of F

So, first we show that \ PrD(supfF|vn(f|D)v(f)|>ϵ)2PrD,D(supfF|vn(f|D)vn(f|D)|>ϵ/2), for sample set D acquired in the same way as D. Lemma

Proof

PrD,D(supfF|vn(f|D)vn(f|D)|>ϵ/2)PrD(supfF|vn(f|D)v(f)|>ϵ)PrD(|vn(f|D)vn(f)|<ϵ/2||vn(f|D)v(f)|>ϵ); and the latter factor can be seen to be small: <1/2 using Chebyshev inequality. This is called the ‘ghost sample technique’.

So, now we need only bound PrD,D(supfF|vn(f|D)vn(f|D)|>ϵ/2). One way to deal with this is to take the union bound now over the set of all dichotomies induced over DD to get the bound: EF(2n)Pr(D,D)(|vn(f|D)vn(f|D)|>ϵ/2), which can then be bounded as described elsewhere.}

Sharper bound

The sharper bound we require can be obtained using a different analysis, involving EF(n) instead. The process of picking D,D and calculating \(n^{-1}|\sum_i f(X_i) - \sum_i f(X’i)|\) is equivalent to the process of picking D,D, and then picking n bits s, and then finding \(n^{-1}|s_i(\sum_i f(X_i) - \sum_i f(X’i))|\) : in other words, we do this n times: pick a pair of points and assign it to D and D at random. So, we bound \(Pr{D, D’,s }(\sup{f \in F} n^{-1}|s_i(\sum_i f(X_i) - \sum_i f(X’i))|> \eps/2) \leq Pr{D, D’,s }(\sup_{f \in F} [|n^{-1}\sum_i s_if(X_i)| \geq \eps/4 \lor |n^{-1}\sum_i -s_if(X’i)| \geq \eps/4]) = 2Pr{D, s}(\sup_{f \in F} |n^{-1}\sum_i s_if(X_i)| \geq \eps/4)\). A union bound over EF(n) members of F and a Hoeffding inequality can now be applied to bound this.

Estimate CDF using empirical CDF

(Glivenko Cantelli). Pick {Zi}F, the CDF. \ Then Pr(supz|Fn(z)F(z)|>ϵ)8(n+1)exp(nϵ232). Pf: Apply VCD theorem with open intervals as classifiers.

Answer to: What n do you need to achieve low error? Thence Borel - Cantelli: limnsup|Fn(z)F(z)|=0 with probability 1.