06 Continuous response variables' prediction

Aka regression.

For overview, see Statistics survey. Here one models a (set of ) response random variable $Y$ in terms of input variables $X$ .

Data preparation and assumptions

Saling, centering, addition of bias variables is assumed below. That, along with motivation, is described in the statistics survey.

Generalized linear model

Linear models

Here, we suppose that $L | X \sim X W + N$ , where $N$ is a 0-mean noise RV. Then, $E [L] = X W$ , which is linear in parameters $w$ .

Correpsonding to the constant variable $X_{0} = 1$ , we have bias parameters $W_{0, :}$ .

Generalization

One can extend the family of linear models so that $E_{L | X} [L] = g^{- 1} (X W)$ and $v a r [L] = f (E_{L | X} [L])$ . Note that the variance is then a function of the predicted value.

A distribution from the exponential family must be used.

Log linear model

Aka poisson regression. $\log (E [L]) = X W$ .

Logistic model

Aka logit model, logistic regression. A generalized linear model. See ‘discriminative models of response’ section.

Perceptron: step function

Here $E_{L | X} [X] = I [X W > 0]$ .

Multi-layer generalized linear model

Aka Artificial Neural Network, multi-layer perceptron (a misnomer given that the activation function described below is not the non-differentiable step function).

Model

Suppose one wants to predict $Y = y$ using the input $X^{(0)} = x^{(0)}$ (aka input layer). The model $Y = h (X^{(0)})$ is hierarchical.

One can obtain layer upon layer of intermediary random variables $X^{(j)} = {X_{i}^{(j)}}$ , where $X_{i}^{(j)} = f (⟨ w_{i}^{(j)}, X_{i}^{(j - 1)} ⟩ + w_{i, 0}^{(j)})$ . Suppose one has $k$ such intermediary layers. One finally models $X_{j}^{(k + 1)} = h (⟨ w_{j}^{(k + 1)}, X^{(k)} ⟩)$ (aka the output layer).

Component names

The intermediary layers are called hidden layers. Neurons in the hidden/ ‘skip’ layers are called hidden units. Neurons in the output layer are called output units.

$a_{i}^{(j)} = ⟨ w_{i}^{(j)}, X_{i}^{j - 1} ⟩ + w_{i, 0}^{(j)}$ is called the activation.

Activation function

$f$ is usually a non-linear function - the logistic step function with the range [-1, 1] and the tanh function are commonly used in case of classification problems being solved by relaxation to regression problem. In case of regression problems or in case of ‘skip’ layer variables, the final $f$ is just the identity function - or a sigmoid function which approximates it.

Visualization as a network

There is the input layer, hidden layers and the output layer. Directed arrows go from one layer to the next. This is a Directed Graphical Model except that the intermediary dependencies are deterministic, not stochastic.

Nomenclature

Depending on preference, a model with $K$ layers of non-input (intermediary + output ) variables is called a $K + 1$ or $K$ layer neural network. We prefer the latter.

2 layer networks are most common.

Connection to other models

\tbc

Model training

One can write $Y = h (X)$ where $h$ is a differentiable, yet non-convex function. One can fit model parameters to training data $((x_{i}, l_{i}))$ by minimizing (possibly regularized) empirical loss.

Gradient finding

Given an error fn $E (y)$ for a given data point $(x, t)$ , various optimization techniques require one to find $\nabla_{w} E (y)$ . This gradient can be found efficiently using the error back-propagation algorithm.

The idea is that the parameter $w_{k, j}^{(f)}$ only affects $E (y)$ through the output : $X_{k}^{(f)}$ , so one can apply the chain rule for partial derivatives.

For output unit, $\frac{d E (X_{1}^{(t)})}{d w_{1, j}^{(t)}} = \frac{d E (X_{1}^{(t)})}{d X_{1}^{(t)}} f^{'} (a_{1}^{(t)}) X_{j}^{(t - 1)}$ . Denote $d_{1}^{t} := \frac{d E (X_{1}^{(t)})}{d X_{1}^{(t)}} f^{'} (a_{1}^{(t)})$ - the quantity multiplied with $X_{j}^{(t - 1)}$ in the expression. This is aka ’error’.

Assume that $\frac{d E (X_{1}^{(t)})}{d w_{i, j}^{(f)}} = d_{i}^{(f)} X_{j}^{(f - 1)}$ holds for neurons in the levels $f : t$ . We can see that a similar expression holds for level $f - 1$ too.

For symbol manipulation convenience, set $i$ th input to $k$ th neuron in layer $f$ : $Z_{k, i}^{f} = X_{i}^{f - 1}$ . Using chain rule for partial derivatives:

$\frac{d E (X_{1}^{(t)})}{d w_{i, j}^{(f - 1)}} = \sum_{k} \frac{d E (X_{1}^{(t)})}{d Z_{k, i}^{f}} \frac{d X_{i}^{(f - 1)}}{d w_{i, j}^{(f - 1)}} = \sum_{k} d_{k}^{f} f^{'} (a_{i}^{(f - 1)}) X_{j}^{(f - 2)}$ .

Setting $d^{(f-1)}i = \sum_k d_k^{f} f’(a_i^{(f-1)})$, we see from mathematical induction that $\frac{dE(X_1^{(t)})}{dw{i, j}^{(f-1)}}$ can be calculated for all neurons given the ’error’ for the layer ahead.

So, the back propagation algorithm to find the gradient is: First run the neural network with input $x$ and record all outputs $X_{j}^{k}$ . Starting with the output layer, determine the error $d_{i}^{(f - 1)}$ and thence the appropriate gradient components.

Weight initialization

Starting point for (stochastic) gradient descent is done as follows. Weights can be initialized randomly with mean 0 and standard deviation $1 / m^{2}$ , where $m$ is the fan-in of a unit.

Flexibility

There are theorems which show that a two layer network can approximate any continuous function to arbitrary accuracy - provided a sufficient number of intermediary variables are allowed!

The flexibility of the multi-layer generalized linear model derives from the non-linearity in the activation functions.

Disadvantages

Objective function minimized during training is non-convex.

Large diversity in training examples required. The model learned is not accessible for use in modeling the process producing the data realistically, though it may be effective.

It can be inefficient in terms of storage space and computational resources required.

The brain by contrast solves all these problems because: its hardware is tuned to the neural network architecture; its training examples have suffiencient variety.

Deep belief network

Extending the idea of neural networks, adding structure to it and using a sort of L1 regularization to make the network sparse, one gets deep belief networks. These have proved to be very successful in many applications since 2007.

\tbc