The goal
Distribution and the oracle
Distribution agnostic. Assumption: training distribution = testing distribution. 2 Oracle PAC model variant: Algorithm can use both \(D^{+}{c}\) and \(D^{-}{c}\),
EQ oracle can be simulated by the EX oracle.
Strong PAC learning
Given
Weak PAC learning algorithm
p, q polynomials;
h is weak hyp for
PAC learning model
Make PAC learning algorithm
For C using H (containing C) with
Decide initial h (maybe null, universal); decide update to hypothesis corresponding to example provided by the sampling oracle : check if +ve examples enough.
If H finite: Create an Occam algorithm; maybe use size(h) to bound
Try to find an algorithm which will grow the hypothesis decision list efficiently, as if solving a set cover problem.
Any
Make an algorithm to learn C using only EQ(c), simulate EQ using EX.
PAC reduce C over X to C’ over X'
Efficiently change
Prove efficient PAC learnability
Make PAC algorithm, bound
Reduce to a known efficient PAC learning algorithm.
Show efficient learnability in the mistake bound model; bound
Find weak learner; boost.
If learning on uniform distribution, see U distribution Fourier analysis section.
Lower bound on sample complexity in PAC
Shatter dimension
m =
\pf Take
Draw
We want to show that there exists a
We then need to show that
From halving algorithm
Halving algorithm
PAC with mq
Any R efficiently PAC learnable using mq’s and eq’s is efficiently learnable with mq’s only.
PAC learnability of some R
Also see list of R which are PAC learnable with noise, learnable in MB.
Halfspaces
Parities
Use the MB algorithm with Occam razor.
ae learning of parities wrt Uniform distribution is equivalent to decoding high rate random linear codes from low number of errors. So, believed to be hard.
Approximate f with orthogonal polynomials
Aka Fourier coefficients. Must have orthogonal polynomials for D.
Uniform distribution PAC learning (UPAC)
Easier than learning under general distributions: can use Fourier analysis.
Harder than learning Gaussian distr. \why
\oprob What is the lower bound for UPAC learning? Does ae learning D
Shortness of DNF term length, Decn tree depth
DNF term-length or 1-D
Learn DNF
Collect all terms of size
The function space
Basis of the input space:
Alg to learn Fourier coefficient
Learn
Best fitting polynomial g of degree d:
Low degree algorithm
If f is
Learning some R under U
Polysize DNF formulae learnable in \
Depth d ckts of size
Based on total influence
f learnable in
So, monotone f learnable in
Hardness of ae learning of PARITY(k)
This is equivalent to decoding high rate random linear code C from low number of errors, a hard open problem. Also equivalent to learning PARITY with noise, in simpler case where noise is random and independent.
If you can decode
If you can learn PARITY(k) using algorithm
By properly ae learning parity you can decode random linear code: Given G and y, make H and yH, use learner.
UPAC with mq
Learning PARITY(k) using namq
Let H be check matrix for a code C; corrupted code y = c+e for some code
Sparse approach
For t-sparse f; or
Thus, using K-M algorithm, f is learnable to accuracy
Weak parity learning
Given f with some t-heavy coefficient, given MQ oracle, find vector z with t/2 heavy
Can also view this as learning parities in agnostic setting. See agnostic learning section.
ae-namq algorithm: Like a sieve
Take an aena-mq parity learning algorithm L.
Run
Can also be fixed to find t/2 heavy component of randomized fn
K-M algorithm to find big components using membership queries
Let
(Kushilevitz, Mansour). Input t; output all
Leaves of the tree are individual coefficients
To find
Opcount: poly(t, n).
Results
Decision trees with t leaves approximable by t-sparse function; so learnable in polynomial time with membership queries.
Learning DNF
Aka Harmonic sieve. DNFs have a
Then boost using p-smooth algorithm.
Boosting weak efficient PAC algorithm L
Practical applications
Frequently used: boost decision trees.
Boost confidence
Run
Boosting accuracy
So you are using the guarantee that polynomial time
General idea for all techniques: put more weight on x on which weak hyp h makes mistake; rerun weak learner.
Lower bound
Number of iterations to get
Properties of booster
Noise tolerance. Smoothness: How far does the artificially generated distribution deviate from original?: Affects speed.
Boost accuracy by filtering
Reweight points using online algorithm.
Majority tree of Hypotheses Booster
(Schapire). Error b bosted to
Recursive accuracy boosting
For target accuracy b and distribution D: Maybe invoke self thrice for accuracy
Boost accuracy by sampling
Reweight points using offline algorithm.
Get sample S; boost accuracy till h consistant with S; bound size of h; bound
Adaptive booster (AdaBoost)
Get sample S; use
Final error
Contrast with maj tree booster
Gives simple hyphothesis; takes advantage of varying error rates of weak hypotheses. \why
Noise tolerance properties
Sensitive to noise even if
Boosting by branching programs
h = Branching program: DAG of hypotheses
Assume that both
At node (j,k); k-1 hypotheses have said h(x)=1; j-k+1 hypotheses have said h(x)=0. So, in final level T; first T/2 leaves labelled 0; rest labelled 1.
Take x with c(x)=1:
Show
Resistant to noise if
Consequences
PAC algorithm memory bound:
Hardness of PAC learning
General techniques
Specify distr where learning hard.
Any C which includes functions which can be efficiently used to solve iterated products [ cryptography ref] is inherently unpredictable.
PAC-reduce a C’ known to be hard to
Show that VCD is
Reduce it to a hard problem in another area: eg: decoding random linear codes.
If RP neq NP
By contradiction: Take NP complete language A; efficiently reduce input a to labelled sample set
Hardness results for proper learning
Usually the
Cryptographic hardness results for improper learning
Aka inherent tractable-unpredictability. Learning hard despite: Polynomial examples being sufficient; or concept evaluation easy.
C which include concepts which can find Discrete cube roots
Take n bit N’s. Make discrete cube root hardness assumption [see cryptography ref]. Consider uniform distr over
By PAC reduction, anything which can compute iterated products is hard to learn.
C which include concepts to find ith bit of BBS pseudorandom generator outputs
Distinguisher D
Input random string; Use Uniform distr over b; use
By PAC reduction, anything which can compute iterated products is hard to learn.
Classes which include RSA decryptors
Inherently tractably-unpredictable classes
Polynomial sized boolean circuits of depth log n and
Also, Neural networks of constant depth with boolean weights: \ (Reif). By PAC reduction from Poly-sized Boolean formulae, Log-space turing machines and DFA inherently unpredictable (but DFA learnable with membership queries).
Intersections of halfspaces. \why Agnostically learning single halfspace. \why
\oprob DNF formulae (can learn under U if RP= NP) : hard to learn? : too weak to encode pseudorandom generators, many cryptographic primitives; if you unconditionally show DNF hard to learn under U, you show