Importance
Fitting a model to observations, ie picking a probability distribution from a family of distributions, is an important component of many statistics tasks where one reasons about uncertainty by explicitly using probability theory. Such tasks are labeled ‘Bayesian inference’.
Choosing the distribution family
Observe empirical distribution
Draw a bar graph, see what the curve looks like.
Given expected values of fns \( \set{E[\ftr_{i(X)}] = \mean_{i} } \) and a base measure h
Suppose we want to modify h as little as possible, under KL divergence, so that it has \(E[\ftr(X)] = \mean\). If h is U, then this is same as finding a maximum entropy distribution with \(E[\ftr(X)] = \mean\). Then, the solution belongs to the exponential family generated by h and \(\ftr()\): see probabilistic models ref.
Given dependence among features
Use graphical models - see probability ref.
Parametric density estimation
Described in a separate chapter.
Non parametric Probability Density estimation
Estimate distribution on input space using \(N\) samples \((x_{i})\), without limiting attention to a certain set of distributions.
Histogram and the Kernel histogram
A distribution from bar-graph of frequency vs input interval. Can simply use a histogram.
Kernel density estimation
(Parzen). \(p(x) = \frac{1}{Nh}\sum_{i=1}^{N} K(\frac{x-x_{i}}{h})\), for kernel K. A smoothening of the histogram using kernels.
Kernel function for density estimation
Non negative real valued integrable K(x) satisfies: \(\int_{-\infty}^{+\infty}K(u)du = 1\) (ensures PDF qualities during density estimation); \(K(-u) = u\) (ensures mean of the PDF is the data point during density estimation). So, K(x) akin to kernel of an integral operator: see functional analysis ref.
Using Gaussian radial basis functions
Aka Gaussian kernel.\ \(K(u) = \frac{1}{2\pi}e^{-\frac{u^{2}}{2}}\): Gaussian function with mean 0, variance 1.
Taking 1 Gaussian distribution/ adding 1 bump for each data point. h, controlling the variance of the bump, called the smoothing parameter/ bandwidth.
+++(Can approximate any distribution by mixture of Gaussians!)+++
Estimate probability measures
Use empirical measures
Empirical measure
The estimated measure using \(n\) samples for a given function is \(v_{n}(f| D) = n^{-1} \sum I_{f}(X_i)\), where \(D = \set{X_i}\) is a set of samples drawn iid from \(v\).
Goodness of estimate: single event
By law of large numbers, as \(\lim_{n \to \infty} E[\frac{\sum f(X_{i})}{n}] = E_X[f(X)] = v(f)\). Also, we can use the Hoeffding inequality (see probabilistic models survey) to bound \(Pr_D(|v(f) - v_n(f|D)| \geq \eps)\) and see that it decreases exponentially with increasing \(n\) and \(\eps\).
Bound variability in estimate
From central limit theorem, we know that, as \(n \to \infty\), the sample distribution approaches \(N(v(f), \frac{\stddev}{\sqrt{n}})\).
Also, we can use: \(Pr_{D, D’}(|v_n(f|D) - v_n(f|D’)|\geq \eps) \leq Pr_{D, D’}(|v_n(f|D) - v(f)|\geq \eps/2 \lor |v_n(f|D’) - v(f)|\geq \eps/2) = 2Pr(|v_n(f|D) - v(f)| \leq \eps/2)\) and use the Hoeffding inequality again.
Goodness of estimate for a class of events
Let \(F = \set{f}\) be a class of events (or binary functions) defined on the input space \(X\), on which \(v\) is a measure. Let \(E_{F}(m)\) be max dichotomy count: see boolean functions survey.
(Vapnik, Chervonenkis).
Then \(Pr(\sup_{f \in F} |v_n(f|D) - v(f)| > \eps) \leq 8 E_{F}(n)exp(\frac{-n\eps^{2}}{32})\).
Proof
If we were to use the union bound and the Hoeffding inequality naively, we would have a factor of \(2|F| » 8E_F(n)\) on the RHS. So, we want to get to a point where we need only take the union bound over a small subset of \(F\)
So, first we show that \ \(Pr_D(\sup_{f \in F} |v_n(f|D) - v(f)| > \eps) \leq 2Pr_{D, D’}(\sup_{f \in F} |v_n(f|D) - v_n(f|D’)| > \eps/2)\), for sample set \(D’\) acquired in the same way as \(D\). Lemma
Proof
\(Pr_{D, D’}(\sup_{f \in F} |v_n(f|D) - v_n(f|D’)| > \eps/2) \geq Pr_{D}(\sup_{f \in F} |v_n(f|D) - v(f)| > \eps)Pr_{D’}(|v_n(f|D’) - v_n(f)| < \eps/2| |v_n(f’|D) - v(f’)| > \eps)\); and the latter factor can be seen to be small: \(<1/2\) using Chebyshev inequality. This is called the ‘ghost sample technique’.
So, now we need only bound \(Pr_{D, D’}(\sup_{f \in F} |v_n(f|D) - v_n(f|D’)| > \eps/2)\). One way to deal with this is to take the union bound now over the set of all dichotomies induced over \(D \union D’\) to get the bound: \(E_F(2n)Pr(D, D’)( |v_n(f|D) - v_n(f|D’)| > \eps/2)\), which can then be bounded as described elsewhere.}
Sharper bound
The sharper bound we require can be obtained using a different analysis, involving \(E_F(n)\) instead. The process of picking \(D, D’\) and calculating \(n^{-1}|\sum_i f(X_i) - \sum_i f(X’i)|\) is equivalent to the process of picking \(D, D’\), and then picking \(n\) bits \(s\), and then finding \(n^{-1}|s_i(\sum_i f(X_i) - \sum_i f(X’i))|\) : in other words, we do this \(n\) times: pick a pair of points and assign it to \(D\) and \(D’\) at random. So, we bound \(Pr{D, D’,s }(\sup{f \in F} n^{-1}|s_i(\sum_i f(X_i) - \sum_i f(X’i))|> \eps/2) \leq Pr{D, D’,s }(\sup_{f \in F} [|n^{-1}\sum_i s_if(X_i)| \geq \eps/4 \lor |n^{-1}\sum_i -s_if(X’i)| \geq \eps/4]) = 2Pr{D, s}(\sup_{f \in F} |n^{-1}\sum_i s_if(X_i)| \geq \eps/4)\). A union bound over \(E_F(n)\) members of \(F\) and a Hoeffding inequality can now be applied to bound this.
Estimate CDF using empirical CDF
(Glivenko Cantelli). Pick \(\set{Z_i} \distr F\), the CDF. \ Then \(Pr(\sup_z |F_n(z) - F(z)| > \eps) \leq 8(n+1)exp(\frac{-n\eps^{2}}{32})\). Pf: Apply VCD theorem with open intervals as classifiers.
Answer to: What \(n\) do you need to achieve low error? Thence Borel - Cantelli: \(\lim_{n \to \infty} \sup |F_n(z) - F(z)| = 0\) with probability 1.