Aka regression.
For overview, see Statistics survey. Here one models a (set of ) response random variable
Data preparation and assumptions
Saling, centering, addition of bias variables is assumed below. That, along with motivation, is described in the statistics survey.
Generalized linear model
Linear models
Here, we suppose that
Correpsonding to the constant variable
Generalization
One can extend the family of linear models so that
A distribution from the exponential family must be used.
Log linear model
Aka poisson regression.
Logistic model
Aka logit model, logistic regression. A generalized linear model. See ‘discriminative models of response’ section.
Perceptron: step function
Here
Multi-layer generalized linear model
Aka Artificial Neural Network, multi-layer perceptron (a misnomer given that the activation function described below is not the non-differentiable step function).
Model
Suppose one wants to predict
One can obtain layer upon layer of intermediary random variables
Component names
The intermediary layers are called hidden layers. Neurons in the hidden/ ‘skip’ layers are called hidden units. Neurons in the output layer are called output units.
Activation function
Visualization as a network
There is the input layer, hidden layers and the output layer. Directed arrows go from one layer to the next. This is a Directed Graphical Model except that the intermediary dependencies are deterministic, not stochastic.
Nomenclature
Depending on preference, a model with
2 layer networks are most common.
Connection to other models
\tbc
Model training
One can write
Gradient finding
Given an error fn
The idea is that the parameter
For output unit,
Assume that
For symbol manipulation convenience, set
Setting \(d^{(f-1)}i = \sum_k d_k^{f} f’(a_i^{(f-1)})\), we see from mathematical induction that \(\frac{dE(X_1^{(t)})}{dw{i, j}^{(f-1)}}\) can be calculated for all neurons given the ’error’ for the layer ahead.
So, the back propagation algorithm to find the gradient is: First run the neural network with input
Weight initialization
Starting point for (stochastic) gradient descent is done as follows. Weights can be initialized randomly with mean 0 and standard deviation
Flexibility
There are theorems which show that a two layer network can approximate any continuous function to arbitrary accuracy - provided a sufficient number of intermediary variables are allowed!
The flexibility of the multi-layer generalized linear model derives from the non-linearity in the activation functions.
Disadvantages
Objective function minimized during training is non-convex.
Large diversity in training examples required. The model learned is not accessible for use in modeling the process producing the data realistically, though it may be effective.
It can be inefficient in terms of storage space and computational resources required.
The brain by contrast solves all these problems because: its hardware is tuned to the neural network architecture; its training examples have suffiencient variety.
Deep belief network
Extending the idea of neural networks, adding structure to it and using a sort of L1 regularization to make the network sparse, one gets deep belief networks. These have proved to be very successful in many applications since 2007.
\tbc