PAC learning

The goal

Distribution and the oracle

Distribution agnostic. Assumption: training distribution = testing distribution. 2 Oracle PAC model variant: Algorithm can use both \(D^{+}{c}\) and \(D^{-}{c}\), errorϵ wrt both.

EQ oracle can be simulated by the EX oracle.

Strong PAC learning

Given ϵ,δ,C, we want to find hC such that, with probability 1δ, Pr(c(x)h(x))ϵ.

Weak PAC learning algorithm

p, q polynomials; ϵ21g=211p(n,size(c)), δ1q(n,size(c)). g: the advantage of L over random guessing. Minimally better than random guessing: any g subexponential, can be achieved by memorizing previously seen examples.

h is weak hyp for c iff: \ PrxD[h(x)f(x)=1]21+g; or ExD[h(x)c(x)]2g.

PAC learning model

Make PAC learning algorithm

For C using H (containing C) with m finite: Cast a \textbf{multivalued discrete classification function} into PAC model: Make a binary classifier concept for each bit.

Decide initial h (maybe null, universal); decide update to hypothesis corresponding to example provided by the sampling oracle : check if +ve examples enough.

If H finite: Create an Occam algorithm; maybe use size(h) to bound |H|; use Occam’s razor. If H finite: Create an Occamish algorithm inconsistent with at most ϵ|S|/2 examples; maybe use size(h) to bound |H|; use Chernoff to bound Pr(bad h;|[cΔh]S|mϵ/2); thence continue Occam razor proof.

Try to find an algorithm which will grow the hypothesis decision list efficiently, as if solving a set cover problem.

Any |H|: Make algorithm to find h consistent with S, use VCD - Occam razor.

Make an algorithm to learn C using only EQ(c), simulate EQ using EX.

PAC reduce C over X to C’ over X'

Efficiently change xXng(x)Xp(n); so that there exists proper image concept c’ for every c with size(c’) = q(size(c)).

Prove efficient PAC learnability

Make PAC algorithm, bound m for (ϵ,δ) approximation of c; prove that running time is polynomial in (ϵ1,δ1,size(c),m,n).

Reduce to a known efficient PAC learning algorithm.

Show efficient learnability in the mistake bound model; bound m using Occam razor or VCD Occam razor.

Find weak learner; boost.

If learning on uniform distribution, see U distribution Fourier analysis section.

Lower bound on sample complexity in PAC

Shatter dimension

m = Ω(dϵ).

\pf Take S shattered by C, let |S|=d and distribution D=U over S. Then, we show that no algorithm L can ensure ϵ error whp. As D is the support of C, without loss of generality, we will consider C to be the finite set of 2d concepts with different dichotomies induced on S.

Draw S with |S|=d2; suppose that L learns h. h=fL(S). But there are |2SS|=2d/2 different concepts c consistent with the labellings in S and so, h is chosen independently of these different concepts. Does h have low error wrt all of these?

We want to show that there exists a c such that the h learned would have a high error rate with non-negligible probability. We will do this by considering the following process: Fix S and h, pick cU(2SS). We then show that Ec[Prx(h(x)c(x))]>1/4, which implies the existence of c for which h is a bad approximation.

Ec[Prx[h(x)c(x)]]=Ec[PrD(xSS)Prc(h(x)c(x)|xSS)]21×21. Thus, there exists a c, for which the h learned using S as the sample would be bad.

We then need to show that L cannot find a good hypothesis for this c with high probability. \tbc

From halving algorithm

Halving algorithm L mistakes in MB model is an approximate lower bound: take set S on which L makes mistakes, make uniform distr on S, use halving algorithm in PAC model, then error probability is (|S|m)/(2log|S|)ϵ, so |S|(11/(nk))mk.

PAC with mq

Any R efficiently PAC learnable using mq’s and eq’s is efficiently learnable with mq’s only.

PAC learnability of some R

Also see list of R which are PAC learnable with noise, learnable in MB.

Halfspaces

f=sgn(w.x), x,w,x0, xi:|ExD[f(x)xi]|W1; so some xi weak hypothesis; then use ADAboost.

Parities

Use the MB algorithm with Occam razor.

ae learning of parities wrt Uniform distribution is equivalent to decoding high rate random linear codes from low number of errors. So, believed to be hard.

Approximate f with orthogonal polynomials

Aka Fourier coefficients. Must have orthogonal polynomials for D.

Uniform distribution PAC learning (UPAC)

Easier than learning under general distributions: can use Fourier analysis.

Harder than learning Gaussian distr. \why

\oprob What is the lower bound for UPAC learning? Does ae learning DL s make sense?

Shortness of DNF term length, Decn tree depth

DNF term-length or 1-DL size is limited to log1ϵ: Probability of longer term being satisfied <ϵ.

Learn DNF

Collect all terms of size log(s/ϵ) consistent with target DNF: use feature expansion, disjunction learning winnow algorithm.

The function space

Basis of the input space: {±1}n. Basis of the function space: Parity functions pS. f=S[n]f^(S)pS. See Boolean function ref for details.

Alg to learn Fourier coefficient

Learn f^(S)\ using f^(S)=ExU{±1}n[f(x)pS(x)]: Choose sample of size m; by Chernoff m=Ω(logδ1l2) for (δ,l) approx of f^(S).

Best fitting polynomial g of degree d: |S|df^(S)pS (See Boolean function ref). Prx(sgn(g(x))f(x))=Ex([sgn(g(x))f(x)])Ex((f(x)g(x))2).

Low degree algorithm

If f is (ϵ,δ) concentrated, f learnable wrt U to accuracy 1-a in time nO(δ): Sample to estimate f^(S):|S|δ, make g, output sgn(g). Agnostically learning big correlated parities.

Learning some R under U

Polysize DNF formulae learnable in \ nO(log|c|log1ϵ)). Halfspaces learnable in nO(ϵ2). Any function of k halfspaces learnable in nO(k2ϵ2). \oprob Do better for of k halfspaces.

Depth d ckts of size |c|. Also Decision trees. \why

Based on total influence

f learnable in nO(dϵ) to accuracy 1ϵ. \why

So, monotone f learnable in nO(nϵ).

Hardness of ae learning of PARITY(k)

This is equivalent to decoding high rate random linear code C from low number of errors, a hard open problem. Also equivalent to learning PARITY with noise, in simpler case where noise is random and independent.

If you can decode C, you can ae learn PARITY: Simply take the examples {xi} you get, turn it into H, get a random G from null(HT), give it to the decoder along with the syndrome {pe(xi)}, return hypothesis pe.

If you can learn PARITY(k) using algorithm L, you can learn it properly whp. Take the hypothesis h returned by L, turn it into an mq oracle which evaluates h(y) by taking h(x+y)+h(y) over many x. Thence make hypothesis parity p attribute efficiently.

By properly ae learning parity you can decode random linear code: Given G and y, make H and yH, use learner.

UPAC with mq

Learning PARITY(k) using namq

Let H be check matrix for a code C; corrupted code y = c+e for some code c and error e with wt(e)<w. Take BCR decoding algorithm (see coding theory ref) A which accepts y and the syndrome yH = eH, and returns e. So, using namq’s on the columns of H, A learns unknown ePARITY(k) attribute efficiently. Another view: using e, A learns parity c using noisy namq’s.

Sparse approach

For t-sparse f; or f12ϵ sparse g: approximator for f.

Thus, using K-M algorithm, f is learnable to accuracy 1ϵ in time \ poly(n\(, \norm{f}{1}, \eps^{-1})\) using membership queries: find coords of g = coords of f of size \(\geq \frac{\eps}{\norm{f}{1}}\); use sgn(g) as hypothesis.

Weak parity learning

Given f with some t-heavy coefficient, given MQ oracle, find vector z with t/2 heavy |f^(z)|. |f^(y)|t iff Pr(f=py)1/2+t/2; so the weakly correlated parity py is 1/2t/2 approximator of f, so called weak parity learning.

Can also view this as learning parities in agnostic setting. See agnostic learning section.

ae-namq algorithm: Like a sieve

Take an aena-mq parity learning algorithm L. fa:f(xa); S: the set of mq points of L. Estimate Fourier coefficients of f with t/4 accuracy. Also, for every y in S, find all Fourier coefficients of fR,y to estimate coefficients of fy. f^c(a)=f^(a)pa(c); so f^c(a)f^(a) and pa(c) have same sign.

Run L for every z0,1m with \(\hat{f}R(z) \geq 3t/4\), using \(\hat{f}{R,c}(z)\hat{f}_R(z)\) to answer mq’s and learn pa. If a=zR, add pa to a set of possible hypotheses. Repeat the process many times with different R. Really good pa occur in hypothesis sets at much higher rate. Thus, find a with f^(a)t/2.

Can also be fixed to find t/2 heavy component of randomized fn ψ. Only, m required for good approximation whp depends on var(ψ).

K-M algorithm to find big components using membership queries

Let |a|=k. \ fa(x):=Z{±1}nkf^(a,Z)pa,Z(a,x).\ Then fa(x)=Ey[f(yx)pa(y)]: Let Z=Z1Z2,Z1{±1}k; Ey[f(yx)pa(y)]= Ey[Zf^(Z)pZ(yx)pa(y)]=Z1,Z2f^(Z1Z2)pZ2(x)Ey[pZ1(y)pa(y)]= Z2f^(aZ2)pZ2(x).

(Kushilevitz, Mansour). Input t; output all |f^(S)|t whp. Make tree: at root find f2; at left child find f02; at right child find f12; if any branch has fa2t2 prune; repeat.

Leaves of the tree are individual coefficients f^(S). Height n. Breadth t2: every level is a partition of {f^(S)}; t2 of {fa} have fa2>t2.

To find fa2=Ex[fa(x)2]: Find fa(x)=Ey[f(yx)pa(y)]: pick random y’s, use membership queries.

Opcount: poly(t, n).

Results

Decision trees with t leaves approximable by t-sparse function; so learnable in polynomial time with membership queries.

Learning DNF

Aka Harmonic sieve. DNFs have a (2s+1)1 heavy component: For any D, there is pa such that |ED[fpa]|(2s+1)1 and wt(a)log((2s+1)D2n). \why.

ED[f(x)pa(x)]=EU[f(x)2nD(x)pa(x)]=ψ^(a). If ye give oracle access to D, you can evaluate ψ. So, use ae-namq algorithm to learn a with ψ^(a)t/2. As var(ψ)2nD(x), m required will depend on this. This is efficient only for D polynomially close to U.

Then boost using p-smooth algorithm.

Boosting weak efficient PAC algorithm L

Practical applications

Frequently used: boost decision trees.

Boost confidence

Run L k times to gather k hypotheses (Get ϵ approx with prob 1(1δ0)k), select best h by using many examples to gauge hΔc.

Boosting accuracy

So you are using the guarantee that polynomial time L works with any distribution to produce a hypothesis which produces very accurate results by playing with input distribution.

General idea for all techniques: put more weight on x on which weak hyp h makes mistake; rerun weak learner.

Lower bound

Number of iterations to get ϵ accurate using γ weak learner is Ω(g2log(1ϵ)).

Properties of booster

Noise tolerance. Smoothness: How far does the artificially generated distribution deviate from original?: Affects speed.

Boost accuracy by filtering

Reweight points using online algorithm.

Majority tree of Hypotheses Booster

(Schapire). Error b bosted to g(b)=3b22b3: Run L to get h1; filter D to get D2 where h1 is useless like random guessing; run L to get h2 - learn something new; filter D to get D3 where h1(x)h2(x); run L to get h3; return majority(h1,h2,h3). Gauge errors of h1 and h2 to ensure efficient filterability.

Recursive accuracy boosting

For target accuracy b and distribution D: Maybe invoke self thrice for accuracy g1(b) and distributions D,D2,D3. Final h like 3-ary tree; only leaves use samples for learning; others sample for error gauging.

Boost accuracy by sampling

Reweight points using offline algorithm.

Get sample S; boost accuracy till h consistant with S; bound size of h; bound |S| using Occam razor.

Adaptive booster (AdaBoost)

Get sample S; use U distr on it. At time t=0: wt on sj is wj=1; W0=wi=m; prob distr D(xi)=wiW. On iteration i: run L, get hypothesis hi:X{±1}, err(hi)=Ei, accuracy ai=1Ei, βi=Eiai; for each sj where hi is correct, reduce their weight: wj:=wjβi. Output after time T: sgn(iaihi21); ai=lgβijlgβj. If E=maxiEi, β=maxiβi, this is sgn(T1ihi21).

Final error ETe2Tg2: W1=m(2E);WT=m(2E)T. Weight of xi misclassified at iteration T βT/2: At T2 of {hi} could’ve been right on it. Total wt on misclassified points ϵmβT/2m(2E)T; so ϵ(4E2β)T/2=(4E(1E))T/2=(4(41g2))T/2exp(2g2T).

e2Tg2<m1 when T>lnm2g2. So, final |h|=O(lnm2g2)|c|.

Contrast with maj tree booster

Gives simple hyphothesis; takes advantage of varying error rates of weak hypotheses. \why

Noise tolerance properties

Sensitive to noise even if L resistant to noise: Distributions created by looking at actual label.

Boosting by branching programs

h = Branching program: DAG of hypotheses {hi}. Labels x according to leaf reached. Run weak learner, get h0. From distr Dh0 and Dh0+, get h1,1 and h1,2. From Dhi,1+ get hi+1,1; from Dhi,j+Dhi,j+1 get hi+1,j+1: diamonds in the DAG; from Dhi,i+1+ get hi+1,i+2.

Assume that both errDc(hi,j) and errDc(hi,j) 21g : can be removed. \why

At node (j,k); k-1 hypotheses have said h(x)=1; j-k+1 hypotheses have said h(x)=0. So, in final level T; first T/2 leaves labelled 0; rest labelled 1.

Take x with c(x)=1: hi=hi,j seen in iteration i: In worst case, Ex[[hi(x)=1]]=21+g; so Ex[leaf where x ends up] =T2+gT. A biased random walk.

Show Prx(h(x)=1)1ϵ: due to lack of independence of events h(x) = 1, martingale is the right tool; let Xi=[hi(x)=1], Y=i=1TXi; Yi=E[Y|X1,..Xi] is a Doob martingale with YiYi1=1; Yi=j=1iXi+Ti2+g(Ti); by Azuma: Pr(|YE[Y]|gT)2eg2Tϵ for T10log1tg2.

Resistant to noise if L resistant to noise: Boosting program creates distributions without looking at c(x).

Consequences

PAC algorithm memory bound: O(1ϵ). \why

Hardness of PAC learning

General techniques

Specify distr where learning hard.

Any C which includes functions which can be efficiently used to solve iterated products [ cryptography ref] is inherently unpredictable.

PAC-reduce a C’ known to be hard to C.

Show that VCD is . Eg: C={convex polygons}.

Reduce it to a hard problem in another area: eg: decoding random linear codes.

If RP neq NP

By contradiction: Take NP complete language A; efficiently reduce input a to labelled sample set Sa where SacC iff aA; assume efficient PAC learning algorithm L; with suitable \(\eps \stackrel{?}{=} |S{a}|^{-1}\), use L with timer to learn ϵ close h whp; see if h is consistant with Sa; if yes, aA; otherwise aA; we have an RP algorithm for A!

Hardness results for proper learning

k3: k-term DNF learning intractable (H is restricted to equal C), but predicting classification accurately tractable: Hardness by reduction to graph coloring problem \why; k-term DNF = some k-CNF by distr law: so improper learning using k-CNF as H is easy. So, conversion from k-CNF learnt to k-term DNF is hard.

Usually the RPNP assumption is used.

Cryptographic hardness results for improper learning

Aka inherent tractable-unpredictability. Learning hard despite: Polynomial examples being sufficient; or concept evaluation easy.

C which include concepts which can find Discrete cube roots

Take n bit N’s. Make discrete cube root hardness assumption [see cryptography ref]. Consider uniform distr over ZN; any algorithm can generate examples by itself - so doesn’t learn anything new; otherwise contradicts assumption. Learning remains hard even if supplied n repeated squares of a number y mod N : no clue about d in this O(n2) input.

By PAC reduction, anything which can compute iterated products is hard to learn.

C which include concepts to find ith bit of BBS pseudorandom generator outputs

fs0,N,t(i): ith output bit of BBS pseudorandom generator with seed s0 and N, if it (see randomized algs ref). If O(nck) time algorithm A to learn C with error 21nk; you can distinguish BBS from random string b of n(d+1)ck bits, where dZ,dc1; and thence factor N.

Distinguisher D

Input random string; Use Uniform distr over b; use (i,bi) as examples; learn with nck examples; get hypothesis h with error 21nk; Pick another bit index j; try predicting bj; if guess correct, output 1 or ‘generator’. On truly random b, Pr(Drand=1)21+nckn(d+1)ck; but Pr(Dfs0,N,t=1)21+nk: difference not negligible.

By PAC reduction, anything which can compute iterated products is hard to learn.

Classes which include RSA decryptors

C={fd,N}, with private key d, public key e: see Cryptography ref. Not learnable in polytime: else make examples over U({0,1}n): (xemodN,x); learn and violate RSA unbreakability assumption.

Inherently tractably-unpredictable classes

Polynomial sized boolean circuits of depth log n and n2 inputs: Can solve Iterated products with ckt of depth log2n (using binary tree like mult ckt). Beame et al: even with log n deep ckt. So, Poly-sized Boolean formulae inherently unpredictable.

Also, Neural networks of constant depth with boolean weights: \ (Reif). By PAC reduction from Poly-sized Boolean formulae, Log-space turing machines and DFA inherently unpredictable (but DFA learnable with membership queries).

Intersections of halfspaces. \why Agnostically learning single halfspace. \why

\oprob DNF formulae (can learn under U if RP= NP) : hard to learn? : too weak to encode pseudorandom generators, many cryptographic primitives; if you unconditionally show DNF hard to learn under U, you show RPNP!