where J(θ) is the value being maximized in the RL objective,
θ∗=θargmax[1][1]Eτ∼pθ(τ)[t∑r(st,at)]J(θ)
and improve it by computing a gradient on the parameters
θ←θ+α∇θJ(θ)
One might think that we're done with just this—we can go train a neural network now and compute gradients ∇θJ(θ) to learn the correct policy. However, whatever machine learning library we're using doesn't know about the external environment that produces the feedback reward according to our policy's decisions. Therefore, we'll just receive a gradient of 0, since our program doesn't even know the variables are influenced by the policy πθ; this is evidenced most clearly by the fact that our estimate of J(θ), i.e. N1∑i∑tr(st(i),at(i)), doesn't even include θ in the expression!
Direct Policy Differentiation
So here's how direct policy differentiation is done to optimize with respect to θ instead of the usual methods.
Let r(τ)=∑t=1Hr(st,at). Then, we can write that
J(θ)=Eτ∼pθ(τ)[r(τ)]=∫pθ(τ)r(τ)dτ
Deriving with respect to θ produces
∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ
where the derivative operator may be moved inside the integral due to linearity of expectation. Now, note the following identity.
info
pθ(τ)∇θlogpθ(τ)=pθ(τ)pθ(τ)∇θpθ(τ)=∇θp(τ)
Which allows us to substitute
∇θJ(θ)=∫pθ(τ)∇θlogpθ(τ)r(τ)dτ
This is important because it allows us to rewrite this as an expectation!
So, the policy gradient is identical except for the reward term; thus, the policy gradient doesn't just try to imitate the sample trajectories, but it also favors positive rewards and penalizes negative rewards. Essentially, you can think of policy gradient as formalizing trial and error!
Partial Observability?
Does policy gradient still function for RL problems with partial observability, i.e ot=st? The answer is yes—it functions pretty much identically, just with logπθ(at(i)∣ot(i)) instead, because the Markov property was never used in the derivation.
Limitations
Policy gradient methods, however, have some limitations. For example, consider applying REINFORCE to an RL problem where a model plays from a preset chess position. The reward is +1 for winning and −1 for losing.
Naturally, the goal is for the model to favor good moves and penalize bad moves; for this to happen, we'd like ∇θJ(θ) to produce positive multipliers for good moves and negative multipliers for bad moves. But this is not necessary the case, since other factors influence the model's winning chances!
The quality of the starting position
Incorrect multipliers when a good move is made, only to be followed later by a bad move (or vice versa)
Now, while these issues do average out, it takes many, many samples! The key problem is that this policy gradient algorithm has high variance. How can we reduce it?
Variance Reduction
Baselines
We can essentially demean the trajectories' rewards by computing
Where (1) derives from the identity we showed earlier, (2) derives from linearity of expectation, and (3) derives from the properties of a probability distribution, i.e. ∫p(x)dx=1 for any probability distribution p over x.
However, for some values of b, it will reduce variance! Letting b be the average reward (i.e. defined above) does reduce variance, and generally performs well (though there exists better, but the optimal baseline is not used in practice due to its complexity).
The intuition here is that, even if all the trajectories' rewards are positive, we still induce the model to more strongly favor the higher rewards, rather than favor all positive rewards.
Causality
Causality means that a policy at time t′ cannot affect reward at time t when t<t′. At the moment, our gradient calculation does not consider causality.
So that the gradient at time step t is only influenced by rewards from time steps t′∈[t,H].
Note that this reduces variance for a rather trivial reason actually—we're multiplying ∇θlogπθ by smaller numbers. Regardless, this modification is typically effective.
Practical Implementation
Automatic Differentiation
We can compute the policy gradients ∇θlogπθ(at(i)∣st(i)) with automatic differentiation, actually. They key idea is that these gradients are the same as the gradients for a supervised neural network; we just have to design the right graph such that its overall gradient is the policy gradient ∇θJ(θ).
We have to essentially set the loss function to be a weighted maximum likelihood loss function (recall the [[#Maximum Likelihood Estimation|comparison]] between maximum likelihood and policy gradient), weighted by reward.
where Q^t(i)=∑t=1Hr(si(t),at(i)). logπθ is cross-entropy for discrete problems and MSE for continuous (Gaussian) problems, like with maximum likelihood.
The pseudocode for maximum likelihood (discrete) for supervised learning would look like