Chapter 4: Dynamic Programming
Beginning in this chapter, and lasting until the end of chapter 8, we will assume that the environment is a finite MDP, i.e. the cardinalities of the state, action, and reward sets are finite. (Otherwise, tabular methods, which is the focus of this section of the book, are inapplicable). Note that problems with continuous spaces may be quantized for tabular techniques, and would subsequently produce an approximate solution.
The main focus of this chapter is utilizing DP methods to learn value functions, which then guide the selection of good policies. After all, as discussed in chapter 3, optimal policies are easily computed once the corresponding optimal value functions are determined. As a reminder, the Bellman optimality equations are
and DP algorithms typically turn these equations into update rules that guide improving approximations for the value functions.
4.1 Policy Evaluation (Prediction)
We'll first consider how to compute the state-value function for some policy .
Recall the Bellman equation.
if the environment's dynamics are known, i.e. we are aware of the state transition probabilities , then the above equation is actually a system of linear equations, i.e. one for each . For our purposes, we will consider how to solve such a system iteratively.
We choose an initial approximation at random (provided any terminal states are assigned a value of ), and derive the update rule with the Bellman equation
where we perform updates . We note that is a fixed point for this update rule, as the Bellman equation for ensures equality in this case. Thus, the sequence converges to as . This policy is known simply as iterative policy evaluation.
Note that the update rule is called an expected update, since it updates using an expectation over immediate rewards and next states. In fact, all DP algorithms use expected updates!
The standard implementation would use two arrays to represent and , and update with values from . However, by using only one array to represent , and then updating values in-place, it's possible to actually achieve faster convergence, as, in some sense, the in-place algorithm uses some already known values of to update some unknown values of .
4.2 Policy Improvement
Once we have computed the value function for a policy, it's natural to seek to then improve the policy based on our new knowledge of the current policy's value function.
The critical idea for policy improvement is that if, for two deterministic policies and ,
then the policy (at least as good as), or
This is known as the policy improvement theorem.
Not quite! Note that the subscript of is , not . This distinction is what makes this result nontrivial, and in fact useful for developing our policy improvement algorithm.
Here is a short proof of the theorem.
where every "" represents an application of the assumption in the policy improvement theorem.
Thus, if we can find a policy that evaluates to a higher value at every state using the value function for policy , then we know policy is better or equal to policy . In particular, this naturally leads to the consideration of a greedy policy such that
which is guaranteed to be better or equal to policy , according to the theorem. This process of choosing greedily with respect to is simply called policy improvement.
Actually, though, we can guarantee something stronger: when following policy improvement, unless .
Suppose . Then , and
which is precisely equivalent to the Bellman optimality equation! Thus, , and .
The ideas of this section extend to stochastic (non-deterministic) policies too.
4.3 Policy Iteration
With a foundational policy evaluation algorithm and policy improvement algorithm, we can construct the basic policy improvement algorithm, which can be summarized with the following sequence of monotonically improving policies and value functions
4.4 Value Iteration
One major issue with policy iteration is that converging to the exact value function for the policy in the policy evaluation step may take a while. Fortunately, we can truncate the policy evaluation step while still preserving the convergence guarantees of policy iteration.
An important special case of this idea is to truncate policy evaluation after just one iteration (one update of each state). This is known as value iteration, and is still known to converge to as the number of iterations approaches infinity. In practice, it's stopped once the value function changes by only a small amount.
4.5 Asynchronous Dynamic Programming
Asynchronous DP algorithms are different from synchronous DP algorithms, which includes all previous algorithms, in that they are not organized with regards to systematic, full sweeps of the set of states. For instance, some states' values may be updated several times before other states are updated. However, to keep convergence guarantees, it must still continue to update all states, i.e. it cannot ignore any state.
Asynchronous DP algorithms can allow interleaving policy evaluation and policy improvement updates. Moreover, they can allow interleaving computation with environment interaction, which enables the DP algorithm to "focus" updates on the most relevant portions of the state set.
4.6 Generalized Policy Iteration
Generalized policy iteration (GPI) is a generalization of the previous policy iteration algorithm that includes both synchronous and asynchronous algorithm, and simply describes the general idea of the interleaving/interaction of the policy-evaluation and policy-improvement processes at any granularity.
In some sense, the evaluation and improvement processes of GPI can be viewed as both competing and cooperating. Making the policy greedy makes the value function outdated, and changing the value function makes the policy non-greedy; however, this interaction slowly brings both closer to the optimal policy and optimal value function.
4.7 Efficiency of Dynamic Programming
If and , a DP method converges to an optimal policy in polynomial time, even despite the number of policies totaling .