Vishvas's notes

Question

(10%) Suppose you have have a STRIPS representation for actions A1 and A2, and you want to define the STRIPS representation for the composite action A1;A2, which means that you do A1 then do A2.

What is the add list for this composite action?
What is the delete list?
What are the preconditions for this composite action?
Give a possible STRIPS representation of the actions Move(here,there) and Pickup(object) and for the composite action Move(here,there);Pickup(object).

Solution

Let the AL(A) and DL(A) denote the add and delete lists of the action A respectively. Let PC(A) denote the preconditions of the action A.

1

: AL(A1;A2) consists of the elements of AL(A2), and those elements of the AL(A1) which will not be deleted by A2.

2

: DL(A1;A2) consists of the elements of DL(A2), and those elements of DL(A1) which will not be added again by A2.

3

: PC(A1;A2) consists of the elements of PC(A1), and those elements of PC(A2) which are not supplied by AL(A1).

4

: A possible STRIPS representation follows: $\begin{array}{rcl} A c t i o n (M o v e (h e r e, t h e r e), & P R E C O N D : A t (h e r e) & E F F E C T :\sim A t (h e r e) \land A t (t h e r e)) \end{array}$

$\begin{array}{rcl} A c t i o n (P i c k u p (o b j e c t), & P R E C O N D : A t (h e r e) \land O b j e c t A t (o b j e c t, h e r e) & E F F E C T :\sim O b j e c t A t (o b j e c t, h e r e) \land P o s s e s s (o b j e c t)) \end{array}$

$\begin{array}{rcl} A c t i o n (M o v e (h e r e, t h e r e); P i c k u p (o b j e c t), & P R E C O N D : A t (h e r e) \land O b j e c t A t (o b j e c t, t h e r e) & E F F E C T :\sim A t (h e r e) \land A t (t h e r e) \land \sim O b j e c t A t (o b j e c t, t h e r e) \land P o s s e s s (o b j e c t)) \end{array}$

Question

(20%) The towers of Hanoi problem is to move a set of n disks of different sizes from one peg to another, using a third peg for temporary storage. Disks are moved one at a time, and a larger disk cannot rest on a smaller one. (You might be familiar with a recursive algorithm for solving this problem.) Formulate this problem as a STRIPS-style planning problem. You will need to specify the initial state, the goal state, and the set of STRIPS-style operators. Feel free to use variables in operator description if needed.

Solution

Initial state: $\begin{array}{rcl} O n (P e g 1, D i s c 1) \land O n (D i s c 1, D i s c 2) \land O n (D i s c 2, D i s c 3) \land \dots O n (D i s c n_{n - 1}, D i s c_{n}) \land B i g g e r (P e g 1, D i s c 1) \land B i g g e r (P e g 2, D i s c 1) \land B i g g e r (P e g 3, D i s c 1) \land B i g g e r (P e g 1, D i s c 2) \land B i g g e r (P e g 2, D i s c 2) \land B i g g e r (P e g 3, D i s c 2) \land \dots B i g g e r (P e g 1, D i s c_{n}) \land B i g g e r (P e g 2, D i s c_{n}) \land B i g g e r (P e g 3, D i s c_{n}) \land B i g g e r (D i s c 1, D i s c 2) \land B i g g e r (D i s c 1, D i s c 3) \land B i g g e r (D i s c 2, D i s c 3) \land \dots B i g g e r (D i s c 1, D i s c_{n}) \land \dots \land B i g g e r (D i s c n_{n - 1}, D i s c_{n}) C l e a r (D i s c_{n}) \land C l e a r (P e g 2) \land C l e a r (P e g 3) \end{array}$

Goal state: $\begin{array}{rcl} O n (P e g 3, D i s c 1) \land O n (D i s c 1, D i s c 2) \land O n (D i s c 2, D i s c 3) \land \dots O n (D i s c n_{n - 1}, D i s c_{n}) \land \end{array}$

The action description (disc, source and dest are variables.): $\begin{array}{rcl} A c t i o n (M o v e (d i s c, s o u r c e, d e s t), P R E C O N D : C l e a r (d i s c) \land C l e a r (d e s t) \land B i g g e r (d e s t, d i s c) \land O n (s o u r c e, d i s c) E F F E C T : O n (d e s t, d i s c) \land \sim C l e a r (d e s t) \land C l e a r (d e s t) \land \sim O n (s o u r c e, d i s c) \end{array}$

Question

(20%) Let us consider a version of the milk/banana/drill shopping problem in which money is included, at least in a simple way.

Let CC denote a credit card that the agent can use to buy any object. Modify the description of Buy so that the agent has to have its credit card in order to buy anything.
Write a Pickup operator that enables the agent to Have an object if it is portable and at the same location as the agent.
Assume that the credit card is at home, but Have(CC) is initially false. Construct a partially ordered plan that achieves the goal, showing both ordering constraints and causal links.
Explain in detail what happens during the planning process when the agent explores a partial plan in which it leaves home without the card.

Solution

1

: The modified buy operator is described by the following:

$\begin{array}{rcl} A c t i o n (B u y (p, s), P R E C O N D : A t (s) \land S e l l s (s, p) \land H a v e (C C) E F F E C T S : H a v e (p) \end{array}$

2

: ‘At(y)’ denotes the presense of the agent at y. $\begin{array}{rcl} A c t i o n (P i c k u p (x), P R E C O N D : P o r t a b l e (x) \land A t (y) \land O b j e c t A t (x, y) E F F E C T S : H a v e (p) \end{array}$

3

: The plan, with the causal links and the constraints, is shown below.

4

: Let us consider an agent which implements ‘Plan Space Planning’. For such an agent, plan flaws initially consist of all open goals. The agent iteratively refines a partial plan. At each step of the refinement, the agent non-deterministically resolves a flaw. Initially, the partial plan will consist of only the ‘Start’ and ‘Finish’ states.

The questions rquires us to consider the case when the agent explores a partial plan where the agent leaves home without the card. There are many partial plans where this happens. I will use the following partial plan to explain in detail what happens during the planning process when the agent explores such a plan (The argument can be extended to other compliant partial plans.):

The above is a plausible partial order plan which can be generated during ‘Plan Space Planning’. We see that the set of flaws, F consists of the Open Goals: {Have(CC), Have(Milk), Have(Bananas), At(Home)}. Now, the agent generates the set of resolvers for these flaws. The set of resolvers are the actions {Pickup(CC), Buy(CC), Pickup(Milk), Buy(Milk), Pickup(Bananas), Buy(Bananas), Go(x, Home)}.

Note: Both Pickup(Milk) and Buy(Milk) can satisfy Have(Milk). But, Pickup(Milk) requires Portable(Milk), ObjectAt(x1,Milk) and At(x1); whereas Buy(Milk) requires Have(CC), At(SM) and Sells(SM,Milk). An agent choosing Pickup(Milk) will end up with a partial plan which cannot be further refined. Hence, let us assume that the agent is designed, in its stochastic choice of resolver to improve partial plans, to select Buy(Milk) rather than Pickup(Milk), because the former has more satisfied preconditions. Due to similar reasons, the agent will select Pickup(CC) over Buy(CC).

Now, it is possible that, in the next refinement of the partial plan, the agent selects Pickup(CC) to resolve the flaw of not having a credit card. As the action ‘Start’ provides At(Home) and ObjectAt(CC, Home), which are prerequisites of Pickup(CC), the agent adds a causal link from Start to Pickup(CC). Similarly, a causal link is added to Buy(Drill, HWS). Also, as the action Go(Home, HWS) negates At(Home), a constraint link is added from Pickup(CC) to Go(Home, HWS). Either after or before this refinement, the partial plan can or could have been refined so as to eliminate the flaws due to the unsatisfied prerequisites, Have(Milk), Have(Bananas) and At(Home) by the addition of suitable resolvers shown in the partial order plan provided in answer to the previous subquestion.

Thus, eventually, the agent arrives at a suitable partial order plan.

Question

(12%) We have only considered planners that have goals of achievement: Take steps to ensure that a proposition is true at some time or in some situation. In this exercise, we consider goals of maintenance and prevention. Maintenance goals involve propositions that must remain true over a given interval of time. Prevention goals involve propositions that must never become true over a given interval of time. Discuss how maintenance and prevention goals can be handled by least commitment planning (e.g. the POP algorithm).

Solution

Operators cause changes in the truth values of propositions. Prevention and Maintenance goals, thus boil down to the problem of scheduling actions, so that the truth values of certain propositions are fixed as required. Consider the prevention goal P, which states that A should be false in the time interval (ta1,ta2), and the maintenance goal M, which says that B should be true in the time interval (tb1,tb2).

Let us consider an agent which implements ‘Plan Space Planning’ (PSP). For such an agent, plan flaws initially consist of all open goals. The agent iteratively refines a partial plan. At each step of the refinement, the agent non-deterministically resolves a flaw. Initially, the partial plan will consist of only the ‘Start’ and ‘Finish’ states.

PSP can handle P and M by preprocessing the problem as follows: For the proposition to be prevented (A), create another proposition, NA, such that the truth value of NA is always the opposite of A. Modify all actions which alter the truth value of A to do so. Create new propositions PA(timeInterval) and MB(timeInterval) to indicate whether propositions A and B were prevented and maintained during a specified time interval.

Add PA(timeIntervalTA) and MB(timeIntervalTB) as preconditions for the ‘Finish’ state. After the ‘Start’ operator, their values should be ‘false’. Add the actions ActP(timeInterval) and ActM(timeInterval), which merely provide PA(timeInterval) and MB(timeInterval) respectively as effects, and require NA and B respectively as prerequisites.

Voila! Now, we have arrived at a specification of the problem such that it can be solved by a ‘Least Commitment Planner’. The PSP algorithm is sufficiently sophisticated, due to its ability to create constraint and causal links, to deal with a problem of this specification. When PSP solves the problem specified above, it ensures satisfaction of prevention and maintenance goals.

Question

(14%) Sometimes MDPs are formulated with a reward function R(s,a) that depends on the action taken or a reward function R(s,a,s’) that also depends on the outcome state.

Write the Bellman equations for these formulations.
Show how an MDP with reward function R(s,a,s’) can be transformed into a different MDP with reward function R(s,a), such that the optimal policies in the new MDP correspond exactly to optimal policies in the original MDP.
Now do the same to convert MDPs with R(s,a) into MDPs with R(s).

Solution

1

In an abstract sense, Bellman equations tell us that the utility of a given state can be calculated from the cumulative future rewards that can be obtained by following an optimum policy. I use this to guide my formulation of Bellman equations for the cases where rewards are of the form R(s,a) and R(s,a,s’).

The Bellman equation when the reward function is of the form R(s) is this: $\begin{array}{rcl} U (s) = R (s) + g * m a x_{a} \sum_{s^{'}} T (s, a, s^{'}) U (s^{'}) \end{array}$

The Bellman equation when the reward function is of the form R(s,a) is this: $\begin{array}{rcl} U (s) = m a x_{a} [R (s, a) + g * \sum_{s^{'}} T (s, a, s^{'}) U (s^{'})] \end{array}$

The Bellman equation when the reward function is of the form R(s,a,s’) is this: $\begin{array}{rcl} U (s) = m a x_{a} \sum_{s^{'}} T (s, a, s^{'}) [R (s, a, s^{'}) + g * U (s^{'})] \end{array}$

2

$$\begin{eqnarray} U(s)=max_{a}\sum_{s’}T(s,a,s’)[R(s,a,s’) + gU(s’)]\ U(s)=max_{a}\sum_{s’}T(s,a,s’)R(s,a,s’) + \sum_{s’}T(s,a,s’)gU(s’)\ But:T(s,a,s’)=P(s’|a,s)\ U(s)=max_{a}[\sum_{s’}P(s’|a,s)R(s,a,s’) + \sum_{s’}T(s,a,s’)gU(s’)]\ \text{From the definition of conditional expectation: }\ E[R(s,a,s’)|s,a]=\sum_{s’}P(s’|a,s)R(s,a,s’)\ E[R(s,a,s’)|s,a]\text{ is a function of s and a, } R’(s,a).\ Hence: U(s)=max_{a}[R’(s,a) + \sum_{s’}T(s,a,s’)gU(s’)]\ \end{eqnarray}$$ Thus, we have reduced the Bellman equation where the reward function is of the form R(s,a,s’) to a Bellman equation where the reward function is of the form R(s,a). During the above transformations, we have not really altered the value of the utility function for any state. Hence, the policy which is the solution of the latter Bellman equation will also solve the former.

3

$\begin{array}{rcl} U (s) = m a x_{a} [R (s, a) + g * \sum_{s^{'}} T (s, a, s^{'}) U (s^{'})] We define a new reward function: R^{'} (s) = R (s, a) where a is the maximizer of the previous expression. U (s) = R^{'} (s) + m a x_{a} [g * \sum_{s^{'}} T (s, a, s^{'}) U (s^{'})] \end{array}$ Thus, we have reduced the Bellman equation where the reward function is of the form R(s,a) to a Bellman equation where the reward function is of the form R(s). During the above transformations, we have not really altered the value of the utility function for any state. Hence, the policy which is the solution of the latter Bellman equation will also solve the former.

Question

(24%) The goal of this exercise is to give you an understanding of the possible disadvantages of using discounted rewards and to introduce the average reward criterion. Discounted optimization is motivated by domains where reward can be interpreted as money that can earn interest, or where there is a fixed probability that a run will be terminated at any given time. However, many problems do not have either of these properties. Discounting in such domains tends to sacrifice long-term rewards in favor of short-term rewards. Moreover, the discounted optimal policy may depend on the choice of the the discount factor. It is true that for any finite MDP (an MDP with finite state and action spaces) there is some sufficiently large $λ$ for which the discounted and undiscounted measures agree. However, proper choice of such $λ$ requires detailed knowledge of the problem. Even with such knowledge, a parameter such as $λ$ that needs to be tailored to suit individual problems is clearly undesirable. Therefore, the agent may prefer to compare policies on the basis of their average expected reward instead of their expected discounted reward. The aim of the average reward MDP is to compute policies that yield the highest expected payoff per time step. The average reward or gain associated with a policy $π$ at state s, is defined as follows (if the average reward exists):

Consider the 14 state MDP whose state-transition diagram is given below. All transitions are deterministic. The agent receives a reward of +5 on moving from the Printer to Home and a reward of +20 on moving from the Mailroom to Home, all other rewards are zero.

How many distinct deterministic policies are there for this MDP? What are they?
For each policy, give an expression for the value of state 1 (assuming discounting)?
For what values of $λ$ in [0,1) does an optimal policy take the agent to the Printer?
For what values of $λ$ in [0,1) does an optimal policy take the agent to the Mailroom?
A policy $π$ is called “Blackwell optimal” for a discounted MDP if there is a $λ$ * in [0,1) such that $π$ is optimal for all $λ$ in [ $λ$ *,1). Does this problem have any Blackwell optimal policies? Explain your answer.
For each policy, calculate the average reward of state 1. Which policy should the agent follow if it seeks to optimize the average reward?
For what range of values of the discount factor $λ$ will the agent select a policy that maximizes the average reward?

Solution

1

There are 2 distinct deterministic policies for the given Markov Decision process. $π (s)$ for any state other than state 1 can only be to execute the sole possible action. Only when the agent is in the state 1, there can be two distinct policies of action. So, one policy, P, requires an agent in state 1 to go to state 2. The other policy, P’, requires an agent in state 1 to go to state 2’.

2

Note that the reward function, which is specified based on transitions rather than states, can be restated, without any loss of generality as a reward function whose value depends on the states it visits: We define $R (s 1) = R^{'} (s 1 \to s 2)$ . The Bellman equation: $\begin{array}{rcl} U (s) = R (s) + g * m a x_{a} \sum_{s^{'}} T (s, a, s^{'}) U (s^{'}) Expanding the Bellman equation repeatedly, for policy P: U (1) = R (1) + g R (2) + g^{2} R (3) + g^{3} R (4) + g^{4} R (5) + g^{6} R (1) \dots But, only R(5) has a non zero value. U (1) = g^{4} R (5) + g^{9} R (5) \dots U (1) = g^{4} R (5) (1 + g^{5} + g^{10} \dots) The geometric series shown above converges because 0 < g^{5} < 1 U (1) = g^{4} R (5) [\frac{1}{1 - g^{5}}] = 5 g^{4} [\frac{1}{1 - g^{5}}] Similarly, for policy P’: U^{'} (1) = g^{9} R (10^{'}) [\frac{1}{1 - g^{10}}] = 20 g^{9} [\frac{1}{1 - g^{10}}] \end{array}$

3

We want to find some $λ$ , for which the optimal policy, when the agent is at 1, will be for the agent to go to 2, rather than 2’. This means that: $\begin{array}{rcl} U (1) > U^{'} (1) 5 g^{4} [\frac{1}{1 - g^{5}}] > 20 g^{9} [\frac{1}{1 - g^{10}}] \frac{1}{1 - g^{5}} > 4 g^{5} [\frac{1}{(1 - g^{5}) (1 + g^{5})}] 1 > 4 g^{5} [\frac{1}{(1 + g^{5})}] (1 + g^{5}) > 4 g^{5} 1 > 3 g^{5} 3^{- 1 / 5} > g \end{array}$ We know that $g >= 0$ . Hence, when g (or $λ$ ) is in $(0, 3^{- 1 / 5})$ , the optimal policy takes the agent to the printer. Note that $3^{- 1 / 5} = 0.803$ .

4

By similar reasoning, when g (or $λ$ ) is in $(3^{- 1 / 5}, 1)$ , the optimal policy takes the agent to the mailing room.

5

As explained above, when $λ$ is in $(3^{- 1 / 5}, 1)$ , the policy P’ is optimal. When, $λ$ is $3^{- 1 / 5}$ , both P and P’ are optimal. Hence, for the range $[3^{- 1 / 5}, 1)$ , the policy P’ is optimal. Hence, the policy does indeed have a Blackwell optimal policy, and that policy is P’, where the agent goes to 2’ instead of 2.

6

Average reward is defined by: $\begin{array}{rcl} R^{'} (s) = lim_{N \to \infty} \frac{1}{N} E [\sum_{t = 0. . N - 1} r (s_{t}, π (s_{t})) | s_{0} = s, π] \end{array}$

For policy P, at time 0, the agent is in state 1. At time x, the agent is in state (x%5)+1. Every 5 time-steps, the agent accrues a reward of 5 units. Hence, using this information in the defining equation for average reward: $\begin{array}{rcl} R^{'} (0) = lim_{N \to \infty} \frac{1}{N} E [\sum_{t = 0. . ⌊ (N - 1) / 5 ⌋} 5] Using the definition of the multiplication operation :-) : R^{'} (0) = lim_{N \to \infty} \frac{1}{N} E [5 (⌊ (N - 1) / 5 ⌋ + 1)] R^{'} (0) = lim_{N \to \infty} \frac{5 (⌊ (N - 1) / 5 ⌋ + 1)}{N} B u t : lim_{N \to \infty} ⌊ (N - 1) / 5 ⌋ + 1 = (N - 1) / 5 S o : R^{'} (0) = lim_{N \to \infty} \frac{(N - 1)}{N} = 1 \end{array}$

For policy P’, at time 0, the agent is in state 1. At time x, the agent is in state (x%10)+1. Every 10 time-steps, the agent accrues a reward of 20 units. Hence, using this information in the defining equation for average reward: $\begin{array}{rcl} R^{'} (0) = lim_{N \to \infty} \frac{1}{N} E [\sum_{t = 0. . ⌊ (N - 1) / 10 ⌋} 20] Using the definition of the multiplication operation :-) : R^{'} (0) = lim_{N \to \infty} \frac{1}{N} E [20 (⌊ (N - 1) / 10 ⌋ + 1)] R^{'} (0) = lim_{N \to \infty} \frac{20 (⌊ (N - 1) / 10 ⌋ + 1)}{N} B u t : lim_{N \to \infty} ⌊ (N - 1) / 10 ⌋ + 1 = (N - 1) / 10 S o : R^{'} (0) = lim_{N \to \infty} \frac{2 (N - 1)}{N} = 2 \end{array}$

So, an agent wanting to optimize average reward should choose policy P’, according to which, it goes to 2’ instead of 2 from 1.

7

Putting together the answers to subquestions 4 and 6, it follows that when g (or $λ$ ) is in $(3^{- 1 / 5}, 1)$ , the optimal policy takes the agent to the mailing room (the course of action that maximizes the average reward).