Reinforcement learning setting
Interaction with the environment
The agent can be in a set of states
The effect of an action is given by transition probability distributions associated with each (sate, action) pair
Policy learning
Agent wants to find the optimum policy
Sub-problems
Looking at occasional reward, agent must assign credit to past actions: credit assignment problem.
Exploration vs exploitation trade-off: can’t find high reward states if you try to greedily known best state.
Reward learning
Aka Inverse reinforcement learning problem.
Sometimes, given the description of the environment (states
Non triviality criteria
A trivial solution would be a reward function which assigns equal reward to all
Motivation
In economics and biology, the precise mechanism that motivates an animal is interesting. Eg: Are waiters who tend to seat customers outside, do so in order to attract other customers? \tbc
Reinforcement learning in Animals
Stimulus
Conditioning
Eg: Pavlov’s dog; behaviorist Skinner training dog to jump against wall in 20 minutes. 1st order vs higher order conditioning. Cats escaping a box to get to fish.
Acquisition due to reinforcement during various trials; Extinction due to removal/ change in R(q): Visualize with a salivation level vs number of trials graph. Spaced trials better for acquisition than massed trials: more time for animal to make correlations.
Habituation: Plateau in response level.
Extinction burst: When you stop reward, animal temporarily tries much harder.
Avoidance: Animal avoids certain states/ stimuli after -ve reward, thereby looses chance to sample/ explore the state further.
Reinforcement schedules: Fixed reward: performance ratio, fixed interval (not good: animal learns interval) etc..
Conditioning due to different reinforcers (food, water etc..).
Learning by simulation
Suppose you wanted to create an intelligent automaton - eg: pilot program for a drone plane. It is very hard to think of all possible cases - a pilot’s knowledge is often intuitive and trained by the environment. On the other hand, it could be simpler to specify and thus simulate the environment. Thus, one can train the automaton in the simulated environment.
Markov decision process (MDP)
Abstract problem
We consider the reinforcement learning problem, but we consider an immediate reward function
Limited dependence on history
State transition shows the Markovian property: dependence only on the previous state and action.
Representation, notation
For a Diagrammatic representation and summary of alternative terms/ symbols used, see Wiki.
Essentially, one considers the state transition graph of a Markov chain whose states are given by
The extension is that edges
Policy
A policy is a map
Long term reward
One can compare policies by somehow aggregating rewards one would expect over
It is logical for the weights corresponding to distant times to be smaller.
Weighted sum: state values
One common way of aggregating reward over
This expected reward for applying policy
Best policy
One ideally wants to find a policy which maximizes the value of the current state.
For a finite state MDP with an infinite horizon where rewards are aggregated using geometrically decreasing weights, one only needs to solve the following linear program to find the policy vector
Solution techniques
Besides linear programming, one can use various iterative dynamic programming techniques.
The policy iteration starts with an arbitrary vectors
Dealing with large state spaces
Dealing with large state-spaces. One can deal with large state spaces by collapsing similar state together. Eg: In the case where the state space corresponds to the 3-dimensional coordinates of an aircraft relative to a target, one can use an alternative state space defined instead by distance to the target.
One can do Q-learning: There is no need to specify explicitly the transition probability P or list the states. This is advantageous when the state space is huge.
Partially observable Markov decision process (POMDP)
Problem setting
As in case of MDP’s, we have sets of states
Observations
But, the states are not necessarily fully observable. So, we have a set of observations
Simplified rewards
The reward function is simpler than in MDP (possibly at the expense of a bigger state space):
Belief states
\tbc