Chapter 6: Temporal-Difference Learning

Temporal-difference (TD) methods are a combination of the ideas from Monte Carlo and dynamic programming methods.

6.1 TD Prediction

Recall a simple every-visit Monte Carlo method

V(S_{t})\leftarrow V(S_{t})+\alpha[G_{t}-V(S_{t})]

where $\alpha$ is constant, i.e. the constant- $\alpha$ MC method.

Whereas Monte Carlo methods must wait until the episode's termination to change $V(S_{t})$ (as they must calculate $G_{t}$ from sampling), TD methods need wait only one time step by bootstrapping.

V(S_{t})\leftarrow V(S_{t})+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_{t})]

In other words, we replace the sampled return $G_{t}$ with an estimated return $G_{t}=R_{t+1}+\gamma V(S_{t+1})$ , using our current estimate of $V$ to determine the target $G_{t}$ .

This method is known as $\text{TD}(0)$ or one-step TD; we explore $\text{TD}(\lambda)$ with nonzero $\lambda$ in more detail next chapter.

Note that $\text{TD}(0)$ is similar to DP methods in that both use the estimate of $V$ in calculating the target return $G_{t}$ ; however, $\text{TD}(0)$ notably does so in expectation over Monte Carlo sampling, unlike dynamic programming. Thus, they merge Monte Carlo sampling with DP bootstrapping.

Also, note that the quantity within the brackets,

\delta_{t}=R_{t+1}+\gamma V(S_{t+1})-V(S_{t})

can be interpreted as an "error" between our current estimate of $V(S_{t})$ and the improved estimate $R_{t+1}+\gamma V(S_{t+1})$ that follows from this sample. This quantity is known as TD error, and shows up often in RL. For instance, Monte Carlo error may be written as

\begin{align*} G_{t}-V(S_{t}) &= R_{t+1}+\gamma G_{t+1}-V(S_{t}) \\ &= R_{t+1}+\gamma G_{t+1}-V(S_{t})+\gamma(V(S_{t+1})-V(S_{t+1})) \\ &= (R_{t+1}+\gamma V(S_{t+1})-V(S_{t}))+\gamma(G_{t+1}-V(S_{t+1})) \\ &= \delta_{t}+\gamma \delta_{t+1}+\gamma^{2}(G_{t+2}-V(S_{t+2})) \\ &= \delta_{t}+\gamma \delta_{t+1}+\dots+ \cancel{ \gamma^{T-t}(G_{T}-V(S_{T})) } \\ &= \sum_{k=t}^{T-1} \gamma^{k-t}\delta_{k} \end{align*}

or a sum of TD errors.

Note that this Monte Carlo error may also serve as an approximation for the error of methods like $\text{TD}(0)$ , in which $V$ is updated during the episode rather than kept constant.

6.2 Advantages of TD Prediction Methods

Compared to DP methods, TD methods have a clear advantage in that they do not require a model of the environment.

Compared to Monte Carlo methods, TD methods are better in the sense that they may be updated in an online, incremental fashion (i.e. during the episode). Moreover, the use of discounting in Monte Carlo methods slows learning; TD methods reduce this issue because they learn from each transition regardless of subsequent actions.

Moreover, TD methods still converge to the correct answer for nay fixed policy $\pi$ . And, in practice, TD methods converge faster than constant- $\alpha$ MC methods on stochastic tasks.

6.3 Optimality of $\text{TD}(0)$

A common method of learning is to present a static dataset of experience until our model or estimator converges.

Evaluate estimator $V$ on entire dataset.
Improve $V$ with desired changes.
Repeat.

This is known as batch updating, and $\text{TD}(0)$ is known to converge deterministically to a single answer under this procedure, independent of $\alpha$ . Notably, a constant- $\alpha$ MC method also converges deterministically, but to a different answer.

In particular, batch MC finds estimates that minimize MSE on the training set, while batch $\text{TD}(0)$ finds the maximum-likelihood estimate of the Markov process. The maximum-likelihood estimate of a Markov process corresponds to a value function estimate known as the certainty-equivalence estimate; this is what batch $\text{TD}(0)$ converges to!

6.4 Sarsa: On-policy TD Control

We now apply TD prediction methods to the control problem. (We will still use TD methods for the prediction portion of GPI).

First, we learn an action-value function rather than a state-value function.

Q(S_{t},A_{t})\leftarrow Q(S_{t},A_{t})+\alpha[R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_{t},A_{t})]

where the update is applied after every transition from a nonterminal state $S_{t}$ , and $Q(S_{t+1},A_{t+1})=0$ if $S_{t+1}$ is terminal.

tip

Note that the update rule uses the elements $\{ S_{t},A_{t},R_{t+1},S_{t+1},A_{t+1} \}$ , which inspired its nomenclature of Sarsa!

Now, as with other on-policy methods, we continually estimate $q_{\pi}$ for our current policy $\pi$ and simultaneously improve $\pi$ toward a greedy solution with respect to $q_{\pi}$ . Notably, though, we must ensure that $\pi$ is $\varepsilon$ -soft to maintaining convergence guarantees; often, an $\varepsilon$ schedule like $\varepsilon=\frac{1}{t}$ is used to produce eventual convergence to the optimal policy overall.

6.5 $Q$ -Learning: Off-policy TD Control

$Q$ -Learning was a major milestone in reinforcement learning, and was the first off-policy TD control algorithm. It's defined by the following updated rule

Q(S_{t},A_{t})\leftarrow Q(S_{t},A_{t})+\alpha[R_{t+1}+\gamma \max _{a}Q(S_{t+1},a)-Q(S_{t},A_{t})]

the key idea of this off-policy algorithm is that we don't need to use $Q(S_{t+1},A_{t+1})$ in our update rule. Instead, we replace the behavior policy's action $A_{t+1}$ with the target policy's action $\underset{a}{\arg\max}\ Q(S_{t+1},a)$ , where we implicitly assume the target policy is a simple greedy policy.

We can't easily do

V

-learning. Can you figure out why?

Our estimate of $V(S_{t})$ is ultimately controlled by the trajectory/sample we're learning from, which is not produced the target policy we're trying to learn. In contrast, $Q(S_{t},A_{t})$ is dependent on the target policy's choice of action and isn't really influenced by the behavior policy; the behavior policy's provision of trajectories only serves to eliminate the need to know state transition probabilities (i.e. produces interaction with the environment to essentially emulate the state transition probability of moving to each $S_{t+1},R_{t+1}$ from $S_{t},A_{t}$ ).

6.6 Expected Sarsa

With $Q$ -learning, we previously used an implicit target policy that was greedy. What if we'd like to learn the $Q$ -function for any target policy, though? Expected Sarsa is useful for this purpose, and has update rule

\begin{align*} Q(S_{t},A_{t})&\leftarrow Q(S_{t},A_{t})+\alpha[R_{t+1}+\gamma \mathbb{E}_{\pi}[Q(S_{t+1},A_{t+1})\mid S_{t+1}]-Q(S_{t},A_{t})] \\ &\leftarrow Q(S_{t},A_{t})+\alpha\left[ R_{t+1}+\gamma \sum_{a}\pi(a\mid S_{t+1})Q(S_{t+1},a)-Q(S_{t},A_{t}) \right] \end{align*}

This is essentially Sarsa, but rather than learn on-policy and use the trajectory's given $A_{t+1}$ to estimate $Q(S_{t+1},A_{t+1})$ , we compute the estimate over the entire action space, weighted by their probability under our behavior policy $\pi$ . This expectation-based update reduces variance compared to the stochastic/sample-based update of Sarsa.

Expected Sarsa also performs better than $Q$ -learning during online training. $Q$ -learning uses an implicit greedy target policy, which may result in updates to the $\epsilon$ -soft behavioral policy that perform poorly on trajectories (e.g. cliff walking example, where the target policy OKs walking right next to the cliff, causing the choice of the behavioral policy to fall off the cliff at times). In contrast, Sarsa may use a target policy that is on-policy or near on-policy (more accurately reflects the behavior policy) to provide better updates to the $\epsilon$ -soft behavioral policy (e.g. cliff walking example, target policy encourages moving away form cliff).

tip

Expected Sarsa is often used in both off-policy and on-policy formats, and is also a generalization of $Q$ -learning—choosing $\pi$ to be the greedy policy then produces $Q$ -learning.

6.7 Maximization Bias and Double Learning

Sarsa often learns $\varepsilon$ -greedy policy, and $Q$ -learning learns a greedy policy. Such policies all involve a $\text{max}$ operator. Unfortunately, this maximum over estimated values is implicitly used as an estimate of the maximum value, which can induce significant positive bias. This is because

\mathbb{E}[\max(X_{1},\dots ,X_{n})]\geq \max (\mathbb{E}[X_{1}],\dots ,\mathbb{E}[X_{n}])

Think about why this should make sense, intuitively, due to the inherent noise of random variables! This bias is known as maximization bias.

Double $Q$ -learning is a solution that eliminates this bias by learning two distinct $Q$ -functions. Its update rules are

\begin{align*} Q_{1}(S_{t},A_{t})&\leftarrow Q_{1}(S_{t},A_{t})+\alpha[R_{t+1}+\gamma Q_{2}(S_{t+1},\underset{a}{\arg\max}\ Q_{1}(S_{t+1},a))-Q_{1}(S_{t},A_{t})] \\ Q_{2}(S_{t},A_{t})&\leftarrow Q_{2}(S_{t},A_{t})+\alpha[R_{t+1}+\gamma Q_{1}(S_{t+1},\underset{a}{\arg\max}\ Q_{2}(S_{t+1},a))-Q_{2}(S_{t},A_{t})] \\ \end{align*}

where, essentially, $Q_{1}$ 's update rule uses $Q_{2}$ to select its maximizing action, and vice versa for $Q_{2}$ . This removes bias involved with evaluating $Q_{1}$ with the $Q_{1}$ -maximizing function, as it would positively bias the estimate since it now becomes, effectively, $\text{actual estimate}+\text{largest positive noise}$ .

info

Typically, one of the two $Q$ -functions is randomly chosen for each transition/step to update.

Note that double versions of Sarsa and Expected Sarsa exist as well.

6.8 Games, Afterstates, and Other Special Cases

So far, our approach has involved learning either an action-value or state-value function. However, there exist problems where such functions are not applicable!

For instance, consider the game of tic-tac-toe. After playing a move, we only have partial observability of the environment's dynamics; in particular, we do not know the full dynamics because we cannot know the move our opponent will play. In such cases, we use a state-value function that evaluates the environment state after the agent performs its action, in contrast to conventional state-value functions that evaluate the state with the assumption that the agent will then have the option of selecting an action. Such states are known as afterstates, and their value functions are known as afterstate value functions.

Notably, we don't have to use afterstate value functions for these types of problems; however, it's much more efficient. A conventional action-value function would map from a state-action pair to a value estimate. In contrast, an afterstate value function would map only from the result of the state-action pair (the next state) to a value estimate. This is important for learning state-action pairs that produce the same afterstate, as these are effectively identical.

We will not discuss afterstate methods in depth; however, the methods and principles in this book (e.g. GPI) still apply to afterstate methods!

Chapter 6: Temporal-Difference Learning

6.1 TD Prediction

6.2 Advantages of TD Prediction Methods

6.3 Optimality of TD(0)\text{TD}(0)TD(0)

6.4 Sarsa: On-policy TD Control

6.5 QQQ-Learning: Off-policy TD Control

6.6 Expected Sarsa

6.7 Maximization Bias and Double Learning

6.8 Games, Afterstates, and Other Special Cases

6.3 Optimality of $\text{TD}(0)$

6.5 $Q$ -Learning: Off-policy TD Control