Aka supervised learning. There are a variety of prediction problems depending on the combination of problem components described below.
Core problem
Input and response variables
A label/ target/ response/ dependent \(L\) depends on some predictor/ input/ independent variable \(X\) (a set of features/ covariates).
Range of X and L
Input space is \(D\) dimensional.
\(range(L)\) may be a subset of a vector space. It may be continuous or discrete.
Labeling rule sought
The agent/ decision rule produced by the learning algorithm must label \(L\) some unlabeled test point(s) \(X\), after some observations/ examples/ training points \(S\). As for decision theory in general, such a labeling rule may be randomized or deterministic.
Action space
\(range(L)\) constitutes the action space in a decision theoretic view of the problem. It is possible for the action space to be expanded to include stating indecision about a label.
Actual phenomenon
Randomized function
In general, the labeling process can be seen a randomized function \(c:ran(X) \to \) the set of RV’s over range(L).
Volatility in form
IN some problems, the randomized labeling process \(c\) changes with previous labelings of the labeling process ( and therefore with observations \(S\)). Eg: Predicting the position of a plane 5 seconds in the future, given its positions in the past few seconds.
In many other problems, the labeling process \(c\) remains independent of observations.
Deterministic Labeling function
For simple phenomena, a deterministic function \(c\) suffices to relate \(X\) and \(L\), so that \(L = c(X)\).
If \(ran(L)\) is discrete, \(c\) is a discrete function, aka discriminant function: \(c: dom(X) \to \set{C_{k}}\).
Features
The labeling function can often be expressed using a feature mapping function \(\ftr(X)\). The various dimensions of \(ran(\ftr)\) aka features of the input.
General noise model
Using a randomized noise function
Usually, the following model can be used to described the phenomenon: \(Y = f(X), L = g(Y)\), where \(f\) is a deterministic labeling function, \(g\) is a randomized function, called the noise function, which maps \(ran(f)\) to a random variable \(L\).
\(g\) is usually considered to be symmetric around the expectation.
Using a Noise variable
Dependence of \(L\) on \(X\) can be written in terms of a deterministic noise application function \(h\) and a random noise variable \(N\), \(L = h(f(X), N)\), where \(f\) is a deterministic labeling function.
Noise in case of vector labels
Let \(L = h(f(X), N)\) describe the dependence of \(L\) on \(X\), as described above. If \(ran(L)\) is part of a vector space, \(h()\) can often be described arithmetically.
Additive noise
An additive noise application model is common: \(L = f(X) + N\), and \(N\) is usually has a symmetric (usually normal) distribution centered around \(0\).
Multiplicative noise
Multiplicative noise models of the form \(h(Y) = NY\) are also interesting. In this case, \(N\) is centered at 1.
Example/ training points
The general properties/ peculiarities of a sample (eg: correlatedness, completeness, online vs offline learning) in general is considered elsewhere; here we are concerned with peculiarities associated with samples in case of the prediction problem.
Labeled
Given \(N\) example points \(S = \set{(X_{i}, L_{i})}\). When such labeled examples are provided as part of the problem, the problem is called supervised learning.
Unlabeled
It is possible that we are additionally given a set \(U\) of unlabeled points. In such a case, the problem is called ‘semi-supervised learning’. The reason maybe that sometimes, easy to get data points, but expensive to label them; or maybe labels are noisy.
Alternative labels
For some data points belonging to the input space \(ran(W)\), labels \(K_i\) may be provided. So, examples are pairs \((W_i, K_i)\). Let the underlying labeling functions be \(f_K: ran(W) \to\) RV’s with range \(ran(K)\) and \(f_L: ran(X) \to \) RV’s with range \(ran(L)\).
When the two input spaces and labeling functions are related, this additional data helps in predicting \(L\). Aka transfer learning problem. Eg: Such examples may help deduce relevant features.
In case of cross domain learning, \(ran(K) = ran(L)\), but possibly \(ran(W) \neq ran(X)\). Eg: search query result relevance identification.
In case of cross category learning, \(ran(K) \neq ran(L)\).
In some applications both the label range and the input space may be the same, but the classificaiton functions may still be different. Eg: Netflix movie ratings by various people.
Eg application: Robot learns to stand using new legs faster using lessons learned when learning to stand using old legs.
Distribution on test points
This is an essential factor in calculating the risk of labeling rules.
This ’test distribution’ usually usually is close to the training data distribution.
Transduction vs induction
If the test points are known during training: transduction; eg: semi-supervised learning: the labeling problem is called transduction.
Otherwise, the labeling problem is called one of induction. This is a harder problem.