Chapter 4: Dynamic Programming

Beginning in this chapter, and lasting until the end of chapter 8, we will assume that the environment is a finite MDP, i.e. the cardinalities of the state, action, and reward sets are finite. (Otherwise, tabular methods, which is the focus of this section of the book, are inapplicable). Note that problems with continuous spaces may be quantized for tabular techniques, and would subsequently produce an approximate solution.

The main focus of this chapter is utilizing DP methods to learn value functions, which then guide the selection of good policies. After all, as discussed in chapter 3, optimal policies are easily computed once the corresponding optimal value functions are determined. As a reminder, the Bellman optimality equations are

\begin{align*} v_{*}(s) &= \max _{a}\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1})\mid S_{t}=s,A_{t=a}] \\ &= \max_{a}\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{*}(s')] \\ q_{*}(s,a) &= \mathbb{E}[R_{t+1}+\gamma \max _{a'}q_{*}(S_{t+1},a'z)\mid S_{t}=s,A_{t}=a] \\ &= \sum_{s',r}p(s',r \mid s,a)[r+\gamma \max _{a'}q_{*}(s',a')] \end{align*}

and DP algorithms typically turn these equations into update rules that guide improving approximations for the value functions.

4.1 Policy Evaluation (Prediction)

We'll first consider how to compute the state-value function $v_{\pi}$ for some policy $\pi$ .

Recall the Bellman equation.

v_{\pi}(s)=\sum_{a}\pi(a\mid s)\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{\pi}(s')]

if the environment's dynamics are known, i.e. we are aware of the state transition probabilities $p(s',r \mid s,a)$ , then the above equation is actually a system of $\lvert \mathcal{S} \rvert$ linear equations, i.e. one for each $s \in \mathcal{S}$ . For our purposes, we will consider how to solve such a system iteratively.

We choose an initial approximation $v_{0}$ at random (provided any terminal states are assigned a value of $0$ ), and derive the update rule with the Bellman equation

\begin{align*} v_{k+1}(s) &= \mathbb{E}_{\pi}[R_{t+1}+\gamma v_{k}(S_{t+1})\mid S_{t}=s] \\ &= \sum_{a}\pi(a\mid s)\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{k}(s')] \end{align*}

where we perform updates $\forall s \in \mathcal{S}$ . We note that $v_{k}=v_{\pi}$ is a fixed point for this update rule, as the Bellman equation for $v_{\pi}$ ensures equality in this case. Thus, the sequence $\{ v_{k} \}$ converges to $v_{k}$ as $k\to \infty$ . This policy is known simply as iterative policy evaluation.

Note that the update rule is called an expected update, since it updates $s$ using an expectation over immediate rewards and next states. In fact, all DP algorithms use expected updates!

In-Place Implementation

The standard implementation would use two arrays to represent $v_{k}$ and $v_{k+1}$ , and update $v_{k+1}$ with values from $v_{k}$ . However, by using only one array to represent $v$ , and then updating values in-place, it's possible to actually achieve faster convergence, as, in some sense, the in-place algorithm uses some already known values of $v_{k+1}$ to update some unknown values of $v_{k+1}$ .

4.2 Policy Improvement

Once we have computed the value function for a policy, it's natural to seek to then improve the policy based on our new knowledge of the current policy's value function.

The critical idea for policy improvement is that if, for two deterministic policies $\pi$ and $\pi'$ ,

q_{\pi}(s,\pi'(s))\geq v_{\pi}(s),\ \forall s \in \mathcal{S}

then the policy $\pi'\geq \pi$ (at least as good as), or

v_{\pi'}(s)\geq v_{\pi}(s),\ \forall s \in \mathcal{S}

This is known as the policy improvement theorem.

Isn't this result trivial? Isn't

q_{\pi}(s,\pi'(s))=v_{\pi'}(s)

if the policies are deterministic?

Not quite! Note that the subscript of $q$ is $\pi$ , not $\pi'$ . This distinction is what makes this result nontrivial, and in fact useful for developing our policy improvement algorithm.

Here is a short proof of the theorem.

\begin{align*} v_{\pi}(s) &\leq q_{\pi}(s,\pi'(s)) \\ &= \mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1})\mid S_{t}=s,A_{t}=\pi'(s)] \\ &= \mathbb{E_{\pi'}}[R_{t+1}+\gamma v_{\pi}(S_{t+1})\mid S_{t}=s] \\ &\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_{\pi}(S_{t+1},\pi'(S_{t+1}))\mid S_{t}=s] \\ &= \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^{2}v_{\pi}(S_{t+2})\mid S_{t}=s] \\ &\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^{2}R_{t+3}+\gamma^{3}v_{\pi}(S_{t+3})\mid S_{t}=s] \\ &\leq \mathbb{E}_{\pi'}\left[ \sum_{i=1}^{\infty} \gamma^{i-1}R_{t+i}\mid S_{t}=s \right] \\ &= v_{\pi'}(s) \end{align*}

where every " $\leq$ " represents an application of the assumption $q_{\pi}(s,\pi'(s))\geq v_{\pi}(s)$ in the policy improvement theorem.

$\square$

Thus, if we can find a policy $\pi'$ that evaluates to a higher value at every state using the value function for policy $\pi$ , then we know policy $\pi'$ is better or equal to policy $\pi$ . In particular, this naturally leads to the consideration of a greedy policy $\pi'$ such that

\pi'(s) = \underset{a}{\arg\max}\ q_{\pi}(s,a)

which is guaranteed to be better or equal to policy $\pi$ , according to the theorem. This process of choosing $\pi'$ greedily with respect to $v_{\pi}$ is simply called policy improvement.

Actually, though, we can guarantee something stronger: when following policy improvement, $\pi'>\pi$ unless $\pi'=\pi=\pi^{*}$ .

Suppose $\pi'=\pi$ . Then $v_{\pi}=v_{\pi'}$ , and

\begin{align*} v_{\pi'}(s) &= \max _{a}\mathbb{E}[R_{t+1}+\gamma v_{\pi'}(S_{t+1})\mid S_{t}=s,A_{t}=A] \\ &= \max _{a}\sum_{s',r}p(s',r \mid s,a)[r+\gamma v_{\pi'}(s')] \end{align*}

which is precisely equivalent to the Bellman optimality equation! Thus, $v_{\pi'}=v_{*}$ , and $\pi'=\pi=\pi^{*}$ .

$\square$

Stochastic Policies?

The ideas of this section extend to stochastic (non-deterministic) policies too.

4.3 Policy Iteration

With a foundational policy evaluation algorithm and policy improvement algorithm, we can construct the basic policy improvement algorithm, which can be summarized with the following sequence of monotonically improving policies and value functions

\pi_{0}\overset{ E }{ \to }v_{\pi_{0}}\overset{ I }{ \to }\pi_{1}\overset{ E }{ \to }v_{\pi_{1}}\overset{ I }{ \to }\dots \overset{ I }{ \to }\pi_{*}\overset{ E }{ \to }v_{*}

4.4 Value Iteration

One major issue with policy iteration is that converging to the exact value function for the policy in the policy evaluation step may take a while. Fortunately, we can truncate the policy evaluation step while still preserving the convergence guarantees of policy iteration.

An important special case of this idea is to truncate policy evaluation after just one iteration (one update of each state). This is known as value iteration, and is still known to converge to $v_{*}$ as the number of iterations approaches infinity. In practice, it's stopped once the value function changes by only a small amount.

4.5 Asynchronous Dynamic Programming

Asynchronous DP algorithms are different from synchronous DP algorithms, which includes all previous algorithms, in that they are not organized with regards to systematic, full sweeps of the set of states. For instance, some states' values may be updated several times before other states are updated. However, to keep convergence guarantees, it must still continue to update all states, i.e. it cannot ignore any state.

Asynchronous DP algorithms can allow interleaving policy evaluation and policy improvement updates. Moreover, they can allow interleaving computation with environment interaction, which enables the DP algorithm to "focus" updates on the most relevant portions of the state set.

4.6 Generalized Policy Iteration

Generalized policy iteration (GPI) is a generalization of the previous policy iteration algorithm that includes both synchronous and asynchronous algorithm, and simply describes the general idea of the interleaving/interaction of the policy-evaluation and policy-improvement processes at any granularity.

In some sense, the evaluation and improvement processes of GPI can be viewed as both competing and cooperating. Making the policy greedy makes the value function outdated, and changing the value function makes the policy non-greedy; however, this interaction slowly brings both closer to the optimal policy and optimal value function.

4.7 Efficiency of Dynamic Programming

If $\lvert \mathcal{S} \rvert=n$ and $\lvert \mathcal{A} \rvert=k$ , a DP method converges to an optimal policy in polynomial time, even despite the number of policies totaling $k^{n}$ .