Logo

Chapter 4: Dynamic Programming

Beginning in this chapter, and lasting until the end of chapter 8, we will assume that the environment is a finite MDP, i.e. the cardinalities of the state, action, and reward sets are finite. (Otherwise, tabular methods, which is the focus of this section of the book, are inapplicable). Note that problems with continuous spaces may be quantized for tabular techniques, and would subsequently produce an approximate solution.

The main focus of this chapter is utilizing DP methods to learn value functions, which then guide the selection of good policies. After all, as discussed in chapter 3, optimal policies are easily computed once the corresponding optimal value functions are determined. As a reminder, the Bellman optimality equations are

v(s)=maxaE[Rt+1+γv(St+1)St=s,At=a]=maxas,rp(s,rs,a)[r+γv(s)]q(s,a)=E[Rt+1+γmaxaq(St+1,az)St=s,At=a]=s,rp(s,rs,a)[r+γmaxaq(s,a)]\begin{align*} v_{*}(s) &= \max _{a}\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1})\mid S_{t}=s,A_{t=a}] \\ &= \max_{a}\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{*}(s')] \\ q_{*}(s,a) &= \mathbb{E}[R_{t+1}+\gamma \max _{a'}q_{*}(S_{t+1},a'z)\mid S_{t}=s,A_{t}=a] \\ &= \sum_{s',r}p(s',r \mid s,a)[r+\gamma \max _{a'}q_{*}(s',a')] \end{align*}

and DP algorithms typically turn these equations into update rules that guide improving approximations for the value functions.

4.1 Policy Evaluation (Prediction)

We'll first consider how to compute the state-value function vπv_{\pi} for some policy π\pi.

Recall the Bellman equation.

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_{\pi}(s)=\sum_{a}\pi(a\mid s)\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{\pi}(s')]

if the environment's dynamics are known, i.e. we are aware of the state transition probabilities p(s,rs,a)p(s',r \mid s,a), then the above equation is actually a system of S\lvert \mathcal{S} \rvert linear equations, i.e. one for each sSs \in \mathcal{S}. For our purposes, we will consider how to solve such a system iteratively.

We choose an initial approximation v0v_{0} at random (provided any terminal states are assigned a value of 00), and derive the update rule with the Bellman equation

vk+1(s)=Eπ[Rt+1+γvk(St+1)St=s]=aπ(as)s,rp(s,rs,a)[r+γvk(s)]\begin{align*} v_{k+1}(s) &= \mathbb{E}_{\pi}[R_{t+1}+\gamma v_{k}(S_{t+1})\mid S_{t}=s] \\ &= \sum_{a}\pi(a\mid s)\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{k}(s')] \end{align*}

where we perform updates sS\forall s \in \mathcal{S}. We note that vk=vπv_{k}=v_{\pi} is a fixed point for this update rule, as the Bellman equation for vπv_{\pi} ensures equality in this case. Thus, the sequence {vk}\{ v_{k} \} converges to vkv_{k} as kk\to \infty. This policy is known simply as iterative policy evaluation.

Note that the update rule is called an expected update, since it updates ss using an expectation over immediate rewards and next states. In fact, all DP algorithms use expected updates!

In-Place Implementation

The standard implementation would use two arrays to represent vkv_{k} and vk+1v_{k+1}, and update vk+1v_{k+1} with values from vkv_{k}. However, by using only one array to represent vv, and then updating values in-place, it's possible to actually achieve faster convergence, as, in some sense, the in-place algorithm uses some already known values of vk+1v_{k+1} to update some unknown values of vk+1v_{k+1}.

4.2 Policy Improvement

Once we have computed the value function for a policy, it's natural to seek to then improve the policy based on our new knowledge of the current policy's value function.

The critical idea for policy improvement is that if, for two deterministic policies π\pi and π\pi',

qπ(s,π(s))vπ(s), sSq_{\pi}(s,\pi'(s))\geq v_{\pi}(s),\ \forall s \in \mathcal{S}

then the policy ππ\pi'\geq \pi (at least as good as), or

vπ(s)vπ(s), sSv_{\pi'}(s)\geq v_{\pi}(s),\ \forall s \in \mathcal{S}

This is known as the policy improvement theorem.

Isn't this result trivial? Isn't qπ(s,π(s))=vπ(s)q_{\pi}(s,\pi'(s))=v_{\pi'}(s) if the policies are deterministic?

Not quite! Note that the subscript of qq is π\pi, not π\pi'. This distinction is what makes this result nontrivial, and in fact useful for developing our policy improvement algorithm.

Here is a short proof of the theorem.

vπ(s)qπ(s,π(s))=E[Rt+1+γvπ(St+1)St=s,At=π(s)]=Eπ[Rt+1+γvπ(St+1)St=s]Eπ[Rt+1+γqπ(St+1,π(St+1))St=s]=Eπ[Rt+1+γRt+2+γ2vπ(St+2)St=s]Eπ[Rt+1+γRt+2+γ2Rt+3+γ3vπ(St+3)St=s]Eπ[i=1γi1Rt+iSt=s]=vπ(s)\begin{align*} v_{\pi}(s) &\leq q_{\pi}(s,\pi'(s)) \\ &= \mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1})\mid S_{t}=s,A_{t}=\pi'(s)] \\ &= \mathbb{E_{\pi'}}[R_{t+1}+\gamma v_{\pi}(S_{t+1})\mid S_{t}=s] \\ &\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_{\pi}(S_{t+1},\pi'(S_{t+1}))\mid S_{t}=s] \\ &= \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^{2}v_{\pi}(S_{t+2})\mid S_{t}=s] \\ &\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^{2}R_{t+3}+\gamma^{3}v_{\pi}(S_{t+3})\mid S_{t}=s] \\ &\leq \mathbb{E}_{\pi'}\left[ \sum_{i=1}^{\infty} \gamma^{i-1}R_{t+i}\mid S_{t}=s \right] \\ &= v_{\pi'}(s) \end{align*}

where every "\leq" represents an application of the assumption qπ(s,π(s))vπ(s)q_{\pi}(s,\pi'(s))\geq v_{\pi}(s) in the policy improvement theorem.

\square

Thus, if we can find a policy π\pi' that evaluates to a higher value at every state using the value function for policy π\pi, then we know policy π\pi' is better or equal to policy π\pi. In particular, this naturally leads to the consideration of a greedy policy π\pi' such that

π(s)=argmaxa qπ(s,a)\pi'(s) = \underset{a}{\arg\max}\ q_{\pi}(s,a)

which is guaranteed to be better or equal to policy π\pi, according to the theorem. This process of choosing π\pi' greedily with respect to vπv_{\pi} is simply called policy improvement.

Actually, though, we can guarantee something stronger: when following policy improvement, π>π\pi'>\pi unless π=π=π\pi'=\pi=\pi^{*}.

Suppose π=π\pi'=\pi. Then vπ=vπv_{\pi}=v_{\pi'}, and

vπ(s)=maxaE[Rt+1+γvπ(St+1)St=s,At=A]=maxas,rp(s,rs,a)[r+γvπ(s)]\begin{align*} v_{\pi'}(s) &= \max _{a}\mathbb{E}[R_{t+1}+\gamma v_{\pi'}(S_{t+1})\mid S_{t}=s,A_{t}=A] \\ &= \max _{a}\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{\pi'}(s')] \end{align*}

which is precisely equivalent to the Bellman optimality equation! Thus, vπ=vv_{\pi'}=v_{*}, and π=π=π\pi'=\pi=\pi^{*}.

\square

Stochastic Policies?

The ideas of this section extend to stochastic (non-deterministic) policies too.

4.3 Policy Iteration

With a foundational policy evaluation algorithm and policy improvement algorithm, we can construct the basic policy improvement algorithm, which can be summarized with the following sequence of monotonically improving policies and value functions

π0Evπ0Iπ1Evπ1IIπEv\pi_{0}\overset{ E }{ \to }v_{\pi_{0}}\overset{ I }{ \to }\pi_{1}\overset{ E }{ \to }v_{\pi_{1}}\overset{ I }{ \to }\dots \overset{ I }{ \to }\pi_{*}\overset{ E }{ \to }v_{*}

4.4 Value Iteration

One major issue with policy iteration is that converging to the exact value function for the policy in the policy evaluation step may take a while. Fortunately, we can truncate the policy evaluation step while still preserving the convergence guarantees of policy iteration.

An important special case of this idea is to truncate policy evaluation after just one iteration (one update of each state). This is known as value iteration, and is still known to converge to vv_{*} as the number of iterations approaches infinity. In practice, it's stopped once the value function changes by only a small amount.

4.5 Asynchronous Dynamic Programming

Asynchronous DP algorithms are different from synchronous DP algorithms, which includes all previous algorithms, in that they are not organized with regards to systematic, full sweeps of the set of states. For instance, some states' values may be updated several times before other states are updated. However, to keep convergence guarantees, it must still continue to update all states, i.e. it cannot ignore any state.

Asynchronous DP algorithms can allow interleaving policy evaluation and policy improvement updates. Moreover, they can allow interleaving computation with environment interaction, which enables the DP algorithm to "focus" updates on the most relevant portions of the state set.

4.6 Generalized Policy Iteration

Generalized policy iteration (GPI) is a generalization of the previous policy iteration algorithm that includes both synchronous and asynchronous algorithm, and simply describes the general idea of the interleaving/interaction of the policy-evaluation and policy-improvement processes at any granularity.

In some sense, the evaluation and improvement processes of GPI can be viewed as both competing and cooperating. Making the policy greedy makes the value function outdated, and changing the value function makes the policy non-greedy; however, this interaction slowly brings both closer to the optimal policy and optimal value function.

4.7 Efficiency of Dynamic Programming

If S=n\lvert \mathcal{S} \rvert=n and A=k\lvert \mathcal{A} \rvert=k, a DP method converges to an optimal policy in polynomial time, even despite the number of policies totaling knk^{n}.