So, why policy gradient? They converge, unlike Q-function methods, and on-policy methods allow Monte Carlo advantage estimators (GAE with λ=1), which provides an unbiased estimator. Thus, policy gradients are stable, reliable methods when sample efficiency is not a concern (samples are cheap to generate).
Let's consider our usual policy gradient algorithm with GAE.
As it is now, it is computationally intractable for practical purposes.
Each iteration takes one gradient step
Each iteration requires generating new samples
It is high variance, and therefore requires large batches and smaller learning rates (in other words, more iterations)
Really, it'd be nice to take multiple gradient steps for every sampling. In other words, repeat steps 5-6 some K times for every execution of steps 1-4. However, this is problematic—the trajectories being used are produced by an old policy, not πθ.
The primary goal of this lecture will be to derive an algorithm that can achieve this goal without issue.
Importance Sampling
Importance sampling is a technique that estimates an expected value under a distribution from which we did not sample on. This is exactly our problem :)
may be written using samples from a different policy pˉ as
J(θ)=Eτ∼pˉ(τ)[pˉ(τ)pθ(τ)r(τ)]
However, there's one critical issue with this formula. We do not assume knowledge of pθ, as pθ(τ)=p(s1)∏t=1Hπθ(at∣st)p(st+1∣st,at). In other words, it's dependent on state transition probabilities, which we may not know. However, consider that
where we once again use the identity that pθ(τ)∇θlogpθ(τ)=∇θpθ(τ), only for θ=θ′ here. Note that this similar to the on-policy policy gradient expression, except with an extra term pθ(τ)pθ′(τ) to account for the difference in policy.
We can substitute some terms in now (recall from policy gradient) to produce
where term A implements causality to ensure future actions don't affect the current weight, and term B represents the probability that this reward r(st′,at′) represents the current policy. These are essentially the "weights" assigned to each reward.
info
Term B is typically eliminated for, intuitively, the same motivation as the ideas from the [[Lecture 8#Double Q-Learning|double Q-learning]] section, i.e. maximizing over an old policy is actually more effective. We explore this in more detail next lecture.
We'll focus more on term A for the above reason. It turns out that term A can actually prove very problematic—as a product over such long time intervals, with each term likely <1 since the trajectory produced by the old policy is likely to have a greater probability under the old policy compared to the new policy, it will vanish at an exponential rate as t increases.
Now, let's rewrite this expectation in terms of sampling.
where πθ(st(i),at(i)) denotes the probability of the state-action pair (st(i),at(i)) occurring under the policy πθ, over the entire state-action space. This is known as a state-action marginal distribution, and is essentially the same expression as the previous importance sampling ratio product in expectation.
Subsequently, we can expand the state-action marginal
and then essentially ignore the relative probabilities of state st(i) occurring under the different policies because we assume that θ′ is very similar to θ, and therefore the state distribution will have shifted very little. In other words, we ignore the importance sampling ratios of the previous steps taken on the trajectory to arrive at this state because we assume the probability of arriving at this state, in expectation, should not change much between policies. Therefore, we only care about the importance sampling ratio of the action probabilities for the current state. We will explore this detail more precisely next lecture. Thus,
One issue is that, once the importance weights πθ(at(i)∣st(i))πθ′(at(i)∣st(i)) stray too far from 1.0, the estimator is encouraged to over-prioritize good samples and over-penalize bad samples, causing high variance. The below figure shows what might happen.
We can fix this by clipping the importance weights. Let w(τ)=pθ(τ)pθ′(τ). We clip by defining
w(τ)=max{1−ϵ,min{1+ϵ,pθ(τ)pθ′(τ)}}
in other words, ∥1−pθ(τ)pθ(τ)∥≤ϵ, which prevents the algorithm from weighting good samples too heavily.
There's one more detail to consider though. Consider a good sample with w(τg)<1+ϵ, and a bad sample with w(τb)≥1+ϵ, such that the two samples are similar enough that increasing w(τg) will increase w(τb), and vice versa. Notably, the estimator is only ever encouraged to increase w(τ) (and thus increase the probability of choosing τ under the new policy θ′) for these samples because w(τb) will stay constant, but w(τg) will increase, based on the clipping. However, this may result in the algorithm making a change that substantially increases the probability of τb under the new policy for only marginal improvements in the probability of τg, because in the perspective of the algorithm with importance weight clipping, this is always an improvement. Thus, the clipping is modified a little bit to be
where, essentially, w(τ) is clipped for good samples, but is not clipped for bad samples. In other words, the bad samples can become arbitrarily bad, but the good samples are always bounded by how good they can be.
Proximal Policy Optimization
Subsequently, we can now implement proximal policy optimization (PPO), or policy gradient with clipped importance sampling.
Add ∑i∑tH(πθ′(⋅∣st(i))) to the objective for entropy regularization that encourages exploration (because the agent now tries to maximize entropy as well).