Chapter 5: Monte Carlo Methods

Unlike Chapter 4, we now assume that we do not have complete knowledge of the environment! In particular, we lack knowledge of state transition probabilities.

This forces us to consider Monte Carlo methods—algorithms that learn from sample experiences of interaction with an environment. Typically, such methods average sample returns to update value and policy estimates. Notably, though, this forces Monte Carlo methods to be incremented only on an episodic basis, and not in an online, step-by-step manner.

Additionally, Monte Carlo methods use the general policy iteration from last chapter, but learn value functions, rather than directly compute them.

5.1 Monte Carlo Prediction

The most intuitive way of learning the value function at a particular state is to simply take the mean of the returns observed after visits to said state. There are two variants though.

First-visit MC method: estimate $v_{\pi}(s)$ as the average of returns following first visits to $s$ (excludes later visits in the same episode).
Every-visit MC method: estimate $v_{\pi}(s)$ as the average of returns following all visits to $s$ .

Both methods converge to $v_{\pi}(s)$ as the number of visits to $s$ approaches infinity.

tip

The computational cost estimating the value of a single state is independent of the size of the state space, making Monte Carlo methods a much more attractive option when looking to estimate the value of only a subset of states.

5.2 Monte Carlo Estimation of Action Value

Without a model of the environment, it's typically more useful to estimate action values $Q(\cdot)$ instead of state values $V(\cdot)$ . This is because an estimate of the value of each action naturally suggests a policy, while with knowledge of state transition probabilities this was not necessary due to the ability to compute over all possible states.

This, however, has some issues for Monte Carlo methods—in particular, when following a deterministic policy $\pi$ , returns will be observed for only one action from each state. Thus, Monte Carlo should maintain exploration to enable estimating the value of all actions.

One solution for this dilemma is exploring starts, in which each episode start in a randomly chosen state-action pair (where every pair has a nonzero probability of selection). This guarantees the visitation of all state-action pairs as the number of episodes approaches infinity.

However, exploring starts is limited, especially when actual interaction is constrained to start only from certain states and actions. A common alternative to consider is simulating only $\epsilon$ -soft policies, i.e. policies where every action has a probability of at least $\epsilon$ of being selected in each state.

5.3 Monte Carlo Control

We can apply Monte Carlo sampling methods to GPI; however, we require two (unlikely) assumptions to guarantee convergence.

Infinite number of episodes
Exploring starts

For the first assumption, we may note that a finite bound on the number of episodes guarantees convergence to an arbitrary degree of approximation of the true value function. However, this is likely to require far too many episodes for practical purposes. Instead, as with value iteration from the last chapter, we simply "give up" in trying to complete policy evaluation and only move the estimate toward the exact value function each evaluation step.

5.4 Monte Carlo Control without Exploring Starts

Exploring starts, as aforementioned, is often not possible. We'd like some other method of guaranteeing all actions are selected infinitely often. There are two approaches.

On-policy: evaluate and improve the policy used to make decisions
Off-policy: evaluate and improve a policy distinct from that used to generate the data

For on-policy approaches, it's necessary to use $\epsilon$ -soft policies, commonly $\epsilon$ -greedy policies. In the policy improvement step, rather than make the policy greedy, we make the policy $\epsilon$ -greedy with respect to the greedy action. Otherwise, GPI is kept the same. Notably, this will result in GPI eventually converging to the optimal $\epsilon$ -greedy policy, not the optimal policy overall!

We can follow a very similar proof as in the last chapter to show that this methodology improves at every step and eventually converges for $\epsilon$ -greedy and, in general, $\epsilon$ -soft policies to the optimal $\epsilon$ -greedy and $\epsilon$ -soft policy, respectively.

5.5 Off-Policy Prediction via Importance Sampling

In off-policy learning, we seek to learn a target policy (that should eventually become optimal) by using a behavior policy (that is more explorative than optimal) to generate data that is "off" the target policy.

info

Due to the inherent drawbacks of sampling data from a different policy, off-policy methods typically exhibit greater variance and thus are slower to converge. However, they are simultaneously much more general and powerful.

First, let us suppose the prediction problem: given samples from following a fixed policy $b$ , how can we estimate $v_{\pi}$ or $q_{\pi}$ for a fixed policy $\pi$ such that $\pi\neq b$ ?

Notably, we must guarantee that every action that may be taken under $\pi$ is also taken, with nonzero probability, under $b$ . That is, we require

\pi(a\mid s)>0\implies b(a\mid s)>0

This is known as the assumption of coverage. It follows, naturally, that $b$ must be stochastic in states where it differs from $\pi$ . Meanwhile, $\pi$ has no such constraints, and may be deterministic (and this is often the case, as we usually desire to eventually improve $\pi$ to an optimal policy in off-policy algorithms).

Subsequently, we may apply a technique known as importance sampling, which allows estimating the expected values under one distribution using samples from another distribution. This may be applied to off-policy learning by weighting returns according to their relative probabilities of occurring under the different policies. For a starting state $S_{t}$ , the probability of the state-action trajectory $A_{t},S_{t+1},A_{t+1},\dots,S_{T}$ occurring under policy $\pi$ is

\prod_{k=t}^{T-1} \pi(A_{k}\mid S_{k})p(S_{k+1}\mid S_{k},A_{k})

and the relative probability of the target policy $\pi$ and behavior policy $b$ is known as the importance-sampling ratio, and may be defined

\rho_{t:T-1}= \prod_{k=t}^{T-1} \frac{\pi(A_{k}\mid S_{k})p(S_{k+1}\mid S_{k},A_{k})}{b(A_{k}\mid S_{k})p(S_{k+1}\mid S_{k},A_{k})}=\prod_{k=t}^{T-1} \frac{\pi(A_{k}\mid S_{k})}{b(A_{k}\mid S_{k})}

where, notably, the state transition probabilities of the environment's MDP conveniently cancel out!

We can use this ratio to then estimate $v_{\pi}$ , as now

v_{\pi}(s)=\mathbb{E}[\rho_{t:T-1}\cdot G_{t}\mid S_{t}=s]

Subsequently, if we let $\mathcal{T}(s)$ denote the set of all time steps in which state $s$ is visited (across all episodes, where we number time steps such that they increase across episode boundaries) and let $T(t)$ denote the final time step for the episode containing $t$ , we can write the Monte Carlo estimate of our value function.

V(s)= \frac{\sum_{t\in \mathcal{T}(s)}\rho_{t:T(t)-1}G_{t}}{\mathcal{T}(s)}

this average is known as ordinary importance sampling. An alternative is

V(s) = \frac{\sum_{t\in \mathcal{T}(s)}\rho_{t:T(t)-1}G_{t}}{\sum_{t\in \mathcal{T}(s)}\rho_{t:T(t)-1}G_{t}}

which is known as weighted importance sampling, as it takes a weighted average of the returns.

There are important differences between the two averages.

Unbiased, but unbounded variance
Biased, but much lower variance.

Consider, for instance, the estimates after observing a single return in a first-visit method.

Ordinary importance sampling, in expectation, is $v_{\pi}(s)$ , but it can have extreme results since slight deviations in the policy probabilities at each time step can quickly explode.
Weighted importance sampling, in expectation, is $v_{b}(s)$ , but its variance is limited by the "weighting" in the denominator.

Notably, the bias and variance of weighted importance sampling converge to zero. In practice, weighted importance sampling performs much better because it demonstrates dramatically lower variance at the cost of a relatively manageable increase in bias.

Additionally, the every-visit methods for both ordinary and weighted importance sampling are biased, though the bias approaches zero as the number of samples increases. In practice,every-visit methods are preferred because of the reduced overhead (no need to track visited states) and they are easier to extend to approximations.

5.6 Incremental Implementation

Monte Carlo prediction methods can be implemented incrementally, on an episode-by-episode basis, much like how previous methods supported incremental updates. We'll discuss the incremental implementations for off-policy methods.

For ordinary importance sampling, we can simply use the methods derived in Chapter 2, but replace rewards with returns. For weighted importance sampling, we have to update a weighted average; this can be done as follows

\begin{align*} V_{n+1} &= V_{n}+\frac{W_{n}}{C_{n}}[G_{n}-V_{n}] \\ C_{n+1} &= C_{n} + W_{n+1} \end{align*}

where $W_{i}$ denotes the weight of return $i$ , and $C_{i}$ denotes the cumulative sum of the weights for the first $i$ returns.

5.7 Off-Policy Monte Carlo Control

The naive off-policy Monte Carlo GPI algorithm updates $W\leftarrow W \frac{1}{b(A_{t}\mid S_{t})}$ every iteration, looping backwards from the end of each episode. Unfortunately, this disproportionately places weight on the tails of episodes. If nongreedy actions are common in the behavior policy, learning will be slow, especially for states that appear early in long episodes. There are some solutions to this problem, discussed in the next sections.

5.8 Discounting-aware Importance Sampling

Recall the idea of discounting; we can apply the same idea to importance sampling.

Consider ordinary importance sampling. The return for time 0 has an importance sampling ratio of

\prod_{i=0}^{T} \frac{\pi(A_{t}\mid S_{t})}{b(A_{t}\mid S_{t})}

However, we mostly care about the immediate action

\frac{\pi(A_{0}\mid S_{0})}{b(A_{0}\mid S_{0})}

as, after the first action, the return is already determined. Therefore, the other factors add a lot of variance.

One may think of discounting, in this situation, as a "probability of termination" $1-\gamma$ applied to each step. This is known as the degree of partial termination, where the return partly terminates in one step with degree $1-\gamma$ , and in general partly terminates in $n$ steps with degree $(1-\gamma)\gamma^{n-1}$ . Such partial returns are known as flat partial returns, i.e.

\bar{G}_{t:h}=R_{t+1}+\dots +R_{h}

where "flat" refers to the absence of discounting and "partial" refers to the premature termination at the horizon $h$ instead of the episode end $T$ .

Then, we may actually rewrite the full, discounted return $G_{t}$ in terms of the flat partial returns

\begin{align*} G_{t} &= R_{t+1}+\gamma R_{t+2}+\dots +\gamma^{T-t-1}R_{T} \\ &= (1-\gamma)R_{t+1} \\ &\quad \ + (1-\gamma)\gamma(R_{t+1}+R_{t+2}) \\ &\quad \ + (1-\gamma)\gamma^{2}(R_{t+1}+R_{t+2}+R_{t+3}) \\ &\quad \ \ \ \vdots \\ &= (1-\gamma)\sum_{h=t+1}^{T-1} \gamma^{h-t-1}\bar{G}_{t:h}+\gamma^{T-t-1}\bar{G}_{t:T} \end{align*}

which allows us to then scale each flat partial return by a similarly truncated importance sampling ratio. Ordinary importance sampling thus becomes

V(s) = \frac{\sum_{t\in \mathcal{T}(s)}\left( (1-\gamma)\sum_{h=t+1}^{T(t)-1} \gamma^{h-t-1}\rho_{t:h-1}\bar{G}_{t:h}+\gamma^{T(t)-t-1}\rho_{t:T(t)-1}\bar{G}_{t:T(t)} \right)}{\lvert \mathcal{T}(s) \rvert }

and weighted-importance sampling is similar, except with we change the denominator to reflect the change in weights (and preserve the weighted average).

Such estimators are known as discounting-aware importance sampling estimators.

5.9 Per-decision Importance Sampling

We note that the importance sampling ratio may be written as a sum

\begin{align*} \rho_{t:T-1}G_{t} &= p_{t:T-1}(R_{t+1}+\gamma R_{t+2}+\dots +\gamma^{T-t-1}\rho_{t:T-1}R_{T}) \\ &= \rho_{t:T-1}R_{t+1}+\gamma \rho_{t:T-1}R_{t+2}+\dots +\gamma^{T-t-1}\rho_{t:T-1}R_{T} \end{align*}

and that each term of the above expression may be rewritten, e.g. the first term is

\rho_{t:T-1}R_{t+1}= R_{t+1} \prod_{t'=t}^{T-1} \frac{\pi(A_{t}\mid S_{t})}{b(A_{t}\mid S_{t})}

Critically, only the first term of the product, $\frac{\pi(A_{t}\mid S_{t})}{b(A_{t}\mid S_{t})}$ , is related to the reward term $R_{t+1}$ ; all other terms reflect importance sampling ratios for events after the reward for this time step. Moreover, in expectation,

\mathbb{E}_{b}\left[ \frac{\pi(A_{k}\mid S_{k})}{b(A_{k}\mid S_{k})} \right] = \sum_{a}b(a\mid S_{k}) \frac{\pi(a\mid S_{k})}{b(a\mid S_{k})}=\sum_{a}\pi(a\mid S_{k})=1

which means that, in expectation,

\mathbb{E}[\rho_{t:T-1}R_{t+1}]=\mathbb{E}[\rho_{t:t}R_{t+1}]

and in general, for the rest of the terms of the original expression,

\mathbb{E}[\rho_{t:T-1}R_{t+k}]=\mathbb{E}[\rho_{t:t+k-1}R_{t+k}]

Therefore, we may rewrite the expectation of our original $\mathbb{E}[\rho_{t:T-1}G_{t}]$ as $\mathbb{E}[\tilde{G}_{t}]$ , where

\tilde{G}_{t}=\rho_{t:t}R_{t+1}+\gamma \rho_{t:t+1}R_{t+2}+\dots +\gamma^{T-t-1}\rho_{t:T-1}R_{T}

this is known as per-decision importance sampling, and it suffices to simply replace $G_{t}$ with $\tilde{G}_{t}$ in the ordinary importance sampling formula. (Notably, it is unclear whether this may be applied to weighted importance sampling).