Learning the best hypothesis

Aka Agnostic learning model.

The goal

Let error in best fit, \(\eta = min_{c \in C} Pr(c(x) \neq l)\). Given \(\del, \eps\); get h where \(Pr(h(x) \neq l) \leq \eta + \eps\).

Underlying distribution

Naught known to the algorithm about process generating data; just asking for approximately best fit or \(\eps\) approx of the \(c\) with error \(\eta\). Even same example \(s\) can have different label \(l\) at different times: this could either be because of noise/ corruption of the label or because of the underlying distribution.

In some settings, it may be clear that the underlying \(Pr(x)\) distribution is of a particular type. This setting is important in practice.

Connection to PAC model

PAC model a special case. So, lower bounds which apply to PAC learning apply to agnostic learning too.

Common algorithms

Very few positive results known. Usually empirical risk minimization used.

Under U, learn f with Fourier concentration

If target \(f \in C\) has good fourier concentration: \ \(\sum_{|S|\geq d} \hat{f}(S)^{2} \leq \eps\); for best \(g \in H\): \(Pr_{x}(g(x)\neq f(x)) = \eta\); then use low degree algorithm to learn g: get \(h = \sum_{|S|< d} \hat{g}(S)p_{S}\); then error of h wrt g:\ \(\norm{\sum_{|S|\geq d} \hat{g}(S)p_{S}}^{2} \leq \norm{g-\sum_{|S|<d}\hat{f}(S)p_{S}}^{2} \leq \norm{g - f + f_{\geq d}}^{2} \leq O(\eta + \eps)\); final error of h wrt f also \(O(\eta + \eps)\).

Learning parities

Equivalent to decoding random linear codes from adversarial errors: \(xG = c\); make \(H\) in \(N(G^T)\); \((c+e)H = eH\). Finding e from eH is equivalent to learning noisy parities. NP hard.

If possible, enables weak parity learning.

Reduces to learning with random noise

Let f have t-heavy component s. This is also a weak parity learner.

Let A-projection of f: \(f_A = \sum_{a \in \set{0, 1}^n: Aa = 1^m} \hat{f}(a)p_a(x)\): So, picking \(2^{-m+n}\) basis parities. \(f_A(x) = E_{p \in \set{0,1}^m}[f(x \xor A^T p)p_{1^m}(p)]\). Thus, Can simulate \(PEX(f_A)\) with PEX(f).

Take [-1, 1] ranged f, \(s \neq 0\), random A: with probability \(2^{-m-1}\), \(\hat{f}A(s) = \hat{f}(s)\) and \(\sum{a \neq s}\hat{f_A}(a)^2 \leq 2^{-m+1} \norm{f}^2\): latter by finding expectation, using Markov inequality.

Algorithm WP: Take learner \(L\) for parities over k vars for every noise \(\eta < 1/2\) in time \(T(\)n\(,k, (1-2\eta)^-1)\) and \(S = S(\)n\(,k, (1-2\eta)^-1)\) examples. Take PEX(f); simulate \(PEX(f_A)\) which \(\approx PEX(\hat{f}(a)(s)p_s(x))\) which is s with noise \(\eta = \frac{1-t}{2}\); run \(L\) to get parity \(s\), output \(s\) if \(s\) is \(t/2\) heavy.

Statistical distance \(\Del(PEX(f_A(x)), PEX(\hat{f}(a)(s)p_s(x))\leq 1/(2S)\); \(L\) uses S examples; so Probability that A notices difference between these is at most 1/2.

Also, \(f\) has at most \(u = 1/t^2\) t-heavy coefficients: so, by coupon collector, with \(u \log u\) repetitions, identify all t-heavy coefficients.

Can learn t-heavy \(p_s\) for \(f\) with range beyond \([-1,1]\), given oracle \(EX(f)\) with correct expected value for \(f(x)\): scale \(f\) by \(\norm{f}{\infty}\); learn \(\frac{t}{\norm{f}{\infty}}\) correlated parity.

Learn DNF, k-juntas

For DNF, use p-smooth booster. Distributions D generated by booster; learning parity \(p_s\) correlated with f under D is same as learning correlated parity for \(2^{n}D(x)f(x)\); scale this: \(F(x) = \frac{2^{n}D(x)f(x)}{\norm{2^{n}D(x)}{\infty}}\); use oracle \(EX{D}(f(x))\), add noise to get \(PEX(F(x))\).

Learning k-juntas: A clean formulation of the problem of efficient learning in the presence of irrelevant attributes. k-junta has at most \(2^k\) non 0 coefficients; Each of them is \(2^{-k+1}\) heavy (See boolean fn ref); use WP to learn full spectrum.

Noise tolerance: Oracle for f with noise \(\eta\) is same as \(PEX((1-2\eta)f)\). Use this oracle in above algorithm to get oracle \(PEX((1-2\eta)f_a p_a)\), use it as above.