Logo

Lecture 5: Policy Gradients

REINFORCE: Basic Policy Gradients

Recall the objective of RL

θ=argmaxθ Eτpθ(τ)[tr(st,at)]\theta^{*} = \underset{\theta}{\arg\max}\ \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\left[ \sum_{t} r(\mathbf{s}_{t},\mathbf{a}_{t}) \right]

with sequences τ\tau sampled from the trajectory distribution

pθ(s1,a1,,sH,aH)=p(s1)t=1Hπθ(atst)p(st+1st,at)p_{\theta}(\mathbf{s}_{1},\mathbf{a}_{1},\dots ,\mathbf{s}_{H},\mathbf{a}_{H}) = p(\mathbf{s}_{1})\prod_{t=1}^{H} \pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t})p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})

With policy gradient methods, we evaluate the policy by computing an estimate of our average reward, i.e.

J(θ)= Eτpθ(τ)[tr(st,at)] 1Nitr(st(i),at(i))J(\theta) = \ \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\left[ \sum_{t} r(\mathbf{s}_{t},\mathbf{a}_{t}) \right]\ \approx \frac{1}{N}\sum_{i}\sum_{t}r(\mathbf{s}_{t}^{(i)},\mathbf{a}_{t}^{(i)})

where J(θ)J(\theta) is the value being maximized in the RL objective,

θ=argmaxθ[1][1] Eτpθ(τ)[tr(st,at)] J(θ)\theta^{*} = \underset{\theta}{\arg\max}\underbracket[1][1]{\ \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\left[ \sum_{t} r(\mathbf{s}_{t},\mathbf{a}_{t}) \right]\ }_{J(\theta)}

and improve it by computing a gradient on the parameters

θθ+αθJ(θ)\theta\leftarrow\theta+\alpha \nabla_{\theta}J(\theta)

One might think that we're done with just this—we can go train a neural network now and compute gradients θJ(θ)\nabla_{\theta}J(\theta) to learn the correct policy. However, whatever machine learning library we're using doesn't know about the external environment that produces the feedback reward according to our policy's decisions. Therefore, we'll just receive a gradient of 00, since our program doesn't even know the variables are influenced by the policy πθ\pi_{\theta}; this is evidenced most clearly by the fact that our estimate of J(θ)J(\theta), i.e. 1Nitr(st(i),at(i))\frac{1}{N}\sum_{i}\sum_{t}r(\mathbf{s}_{t}^{(i)},\mathbf{a}_{t}^{(i)}), doesn't even include θ\theta in the expression!

Direct Policy Differentiation

So here's how direct policy differentiation is done to optimize with respect to θ\theta instead of the usual methods.

Let r(τ)=t=1Hr(st,at)r(\tau)=\sum_{t=1}^Hr(\mathbf{s}_{t},\mathbf{a}_{t}). Then, we can write that

J(θ)=Eτpθ(τ)[r(τ)]=pθ(τ)r(τ)dτJ(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[r(\tau)]=\int p_{\theta}(\tau)r(\tau) \,\mathrm{d}\tau

Deriving with respect to θ\theta produces

θJ(θ)=θpθ(τ)r(τ)dτ\nabla_{\theta}J(\theta)=\int \nabla_{\theta}p_{\theta}(\tau)r(\tau) \,\mathrm{d}\tau

where the derivative operator may be moved inside the integral due to linearity of expectation. Now, note the following identity.

info
pθ(τ)θlogpθ(τ)=pθ(τ)θpθ(τ)pθ(τ)=θp(τ)p_{\theta}(\tau)\nabla_{\theta}\log p_{\theta}(\tau)=p_{\theta}(\tau) \frac{\nabla_{\theta}p_{\theta}(\tau)}{p_{\theta}(\tau)}= \nabla_{\theta}p(\tau)

Which allows us to substitute

θJ(θ)=pθ(τ)θlogpθ(τ)r(τ)dτ\nabla_{\theta}J(\theta)=\int p_{\theta}(\tau)\nabla_{\theta}\log p_{\theta}(\tau)r(\tau) \,\mathrm{d}\tau

This is important because it allows us to rewrite this as an expectation!

θJ(θ)=(pθ(τ))(θlogpθ(τ)r(τ))dτ=Eτpθ(τ)[θlogpθ(τ)r(τ)]\nabla_{\theta}J(\theta)=\int (p_{\theta}(\tau))(\nabla_{\theta}\log p_{\theta}(\tau)r(\tau)) \,\mathrm{d}\tau = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\nabla_{\theta}\log p_{\theta}(\tau)r(\tau)]

This conclusion is sometimes called the policy gradient theorem, and is very important—make sure you understand this!!

There's still a little bit of work to be done to finish this off, though. It's not immediately obvious how to compute logpθ(τ)\log p_{\theta}(\tau). First, note that

pθ(τ)=pθ(s1,a1,,sH,aH)=p(s1)t=1Hπθ(atst)p(st+1st,at)p_{\theta}(\tau)=p_{\theta}(\mathbf{s}_{1},\mathbf{a}_{1},\dots ,\mathbf{s}_{H},\mathbf{a}_{H}) = p(\mathbf{s}_{1})\prod_{t=1}^{H} \pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t})p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})

Thus,

logpθ(τ)=logp(s1)+t=1H[logπθ(atst)+logp(st+1st,at)]θlogpθ(τ)=logp(s1)+t=1H[θlogπθ(atst)+logp(st+1st,at)]=t=1Hθlogπθ(atst)θJ(θ)=Eτp(τ)[(t=1Hθlogπθ(atst))(t=1Hr(st,at))]\begin{align*} \log p_{\theta}(\tau)&=\log p(\mathbf{s}_{1})+\sum_{t=1}^{H} \left[\log \pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t}) + \log p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})\right] \\ \nabla_{\theta}\log p_{\theta}(\tau) &= \cancel{ \log p(\mathbf{s}_{1}) }+\sum_{t=1}^{H} \left[\nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t}) + \log \cancel{ p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t}) }\right] \\ &= \sum_{t=1}^{H} \nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t}) \\ \nabla_{\theta}J(\theta) &= \mathbb{E}_{\tau \sim p(\tau)}\left[ \left( \sum_{t=1}^{H} \nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t}) \right)\left( \sum_{t=1}^{H} r(\mathbf{s}_{t},\mathbf{a}_{t}) \right) \right] \end{align*}

Consequently, to evaluate the gradient, we can directly compute it from the policy evaluations!

θJ(θ)1Ni=1N[(t=1Hθlogπθ(at(i)st(i)))(t=1Hr(st(i),at(i)))]\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \left[ \left( \sum_{t=1}^{H} \nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)}) \right)\left( \sum_{t=1}^{H} r(\mathbf{s}_{t}^{(i)},\mathbf{a}_{t}^{(i)}) \right) \right]

This algorithm, in fact, is known as REINFORCE:

  1. Sample τ\tau (run policy)
  2. Perform direct policy differentiation
  3. Update θθ+αθJ(θ)\theta\leftarrow\theta+\alpha \nabla_{\theta}J(\theta)

Maximum Likelihood Estimation

We can compare this policy gradient with the gradient for imitation learning computed by maximum likelihood estimation.

policy gradient:θJ(θ)1Ni=1N[(t=1Hθlogπθ(at(i)st(i)))(t=1Hr(st(i),at(i)))]maximum likelihood:θJML(θ)1Ni=1N(t=1Hθlogπθ(at(i)st(i)))\begin{align*} \text{policy gradient:} && \nabla_{\theta}J(\theta) &\approx \frac{1}{N}\sum_{i=1}^{N} \left[ \left( \sum_{t=1}^{H} \nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)}) \right)\left( \sum_{t=1}^{H} r(\mathbf{s}_{t}^{(i)},\mathbf{a}_{t}^{(i)}) \right) \right] \\ \text{maximum likelihood:} && \nabla_{\theta}J_{\text{ML}}(\theta) &\approx \frac{1}{N}\sum_{i=1}^{N} \left( \sum_{t=1}^{H} \nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)}) \right) \end{align*}

So, the policy gradient is identical except for the reward term; thus, the policy gradient doesn't just try to imitate the sample trajectories, but it also favors positive rewards and penalizes negative rewards. Essentially, you can think of policy gradient as formalizing trial and error!

Partial Observability?

Does policy gradient still function for RL problems with partial observability, i.e otst\mathbf{o}_{t}\neq \mathbf{s}_{t}? The answer is yes—it functions pretty much identically, just with logπθ(at(i)ot(i))\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{o}_{t}^{(i)}) instead, because the Markov property was never used in the derivation.

Limitations

Policy gradient methods, however, have some limitations. For example, consider applying REINFORCE to an RL problem where a model plays from a preset chess position. The reward is +1+1 for winning and 1-1 for losing.

Naturally, the goal is for the model to favor good moves and penalize bad moves; for this to happen, we'd like θJ(θ)\nabla_{\theta}J(\theta) to produce positive multipliers for good moves and negative multipliers for bad moves. But this is not necessary the case, since other factors influence the model's winning chances!

Now, while these issues do average out, it takes many, many samples! The key problem is that this policy gradient algorithm has high variance. How can we reduce it?

Variance Reduction

Baselines

We can essentially demean the trajectories' rewards by computing

b=1Ni=1Nr(τ)b= \frac{1}{N}\sum_{i=1}^{N} r(\tau)

and modifying θJ(θ)\nabla_{\theta}J(\theta) to

θJ(θ)1Ni=1Nθlogpθ(τ)[r(τ)b]=1Ni=1N[θlogpθ(τ)r(τ)θlogpθ(τ)b]\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_{\theta}\log p_{\theta}(\tau)[r(\tau)-b] = \frac{1}{N}\sum_{i=1}^{N} [\nabla_{\theta}\log p_{\theta}(\tau)r(\tau)-\nabla_{\theta}\log p_{\theta}(\tau)b]

Notably, the estimator of θJ(θ)\nabla_{\theta}J(\theta) remains unbiased in expectation!

E[θlogpθ(τ)b]=pθ(τ)θlogpθ(τ)bdτ=θpθ(τ)bdτ(1)=bθpθ(τ)dτ(2)=bθ(1)(3)=0\begin{align*} \mathbb{E}[\nabla_{\theta}\log p_{\theta}(\tau)b] &= \int p_{\theta}(\tau)\nabla_{\theta}\log p_{\theta}(\tau )b \,\mathrm{d}\tau \\ &= \int \nabla_{\theta}p_{\theta}(\tau)b \, \mathrm{d}\tau && (1) \\ &= b\nabla_{\theta}\int p_{\theta}(\tau) \,\mathrm{d}\tau && (2) \\ &= b\nabla_{\theta}(1) && (3) \\ &= 0 \end{align*}

Where (1)(1) derives from the identity we showed earlier, (2)(2) derives from linearity of expectation, and (3)(3) derives from the properties of a probability distribution, i.e. p(x)dx=1\int p(x) \,\mathrm{d}x = 1 for any probability distribution pp over xx.

However, for some values of bb, it will reduce variance! Letting bb be the average reward (i.e. defined above) does reduce variance, and generally performs well (though there exists better, but the optimal baseline is not used in practice due to its complexity).

The intuition here is that, even if all the trajectories' rewards are positive, we still induce the model to more strongly favor the higher rewards, rather than favor all positive rewards.

Causality

Causality means that a policy at time tt' cannot affect reward at time tt when t<tt<t'. At the moment, our gradient calculation does not consider causality.

θJ(θ)1Ni=1N[(t=1Hθlogπθ(at(i)st(i)))(t=1Hr(st(i),at(i))bt)]\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \left[ \left( \sum_{t=1}^{H} \nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)}) \right)\left( \sum_{t=1}^{H} r(\mathbf{s}_{t}^{(i)},\mathbf{a}_{t}^{(i)}) - b_{t} \right) \right]

The gradient θlogπθ(at(i)st(i))\nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)}) is affected by rewards across all time steps. Instead, by including causality, we produce

θJ(θ)1Ni=1Nt=1Hθlogπθ(at(i)st(i))(t=tHr(st(i),at(i))bt)\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{H} \nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)}) \left( \sum_{t'=t}^{H} r(\mathbf{s}_{t'}^{(i)},\mathbf{a}_{t'}^{(i)}) - b_{t'} \right)

So that the gradient at time step tt is only influenced by rewards from time steps t[t,H]t'\in[t,H].

Note that this reduces variance for a rather trivial reason actually—we're multiplying θlogπθ\nabla_{\theta}\log \pi_{\theta} by smaller numbers. Regardless, this modification is typically effective.

Practical Implementation

Automatic Differentiation

We can compute the policy gradients θlogπθ(at(i)st(i))\nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)}) with automatic differentiation, actually. They key idea is that these gradients are the same as the gradients for a supervised neural network; we just have to design the right graph such that its overall gradient is the policy gradient θJ(θ)\nabla_{\theta}J(\theta).

We have to essentially set the loss function to be a weighted maximum likelihood loss function (recall the [[#Maximum Likelihood Estimation|comparison]] between maximum likelihood and policy gradient), weighted by reward.

J~(θ)=1Ni=1Nt=1Hlogπθ(at(i)st(i))Q^t(i)\tilde{J}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{H} \log \pi_{\theta}(\mathbf{a}_{t}^{(i)}\mid \mathbf{s}_{t}^{(i)})\hat{Q}_{t}^{(i)}

where Q^t(i)=t=1Hr(si(t),at(i))\hat{Q}_{t}^{(i)}=\sum_{t=1}^{H}r(\mathbf{s}_{i}^{(t)},\mathbf{a}_{t}^{(i)}). logπθ\log \pi_{\theta} is cross-entropy for discrete problems and MSE for continuous (Gaussian) problems, like with maximum likelihood.

The pseudocode for maximum likelihood (discrete) for supervised learning would look like

logits = policy.predictions(states)
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)

While the pseudocode for policy gradient (discrete) for reinforcement learning would look like

logits = policy.predictions(states)
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)

And that's it!

Reminders