Logo

Lecture 14: RL with Sequences and LLMs

Reward Learning and Inverse RL

Recall the idea of inverse reinforcement learning (IRL): learning a model of the reward function demonstrated through some samples from an "expert" actor. This is in contrast to behavioral cloning in imitation learning, which simply seeks to mimic the behavior of the expert, and therefore expresses limited generalization to unseen states that require similar reasoning. (There are also other advantages of IRL, such as a lack of the compounding error from distributional shift that behavioral cloning typically encounters).

When we attempt to apply inverse RL to our previous soft optimality framework, we attempt to learn the reward model, provided we have sample trajectories from a soft-optimal policy. We use the equation

p(Otst,at,ψ)=exp(rψ(st,at))p(\mathcal{O}_{t}\mid \mathbf{s}_{t},\mathbf{a}_{t},\psi)=\exp(r_{\psi}(\mathbf{s}_{t},\mathbf{a}_{t}))

and use maximum likelihood estimation to find that

ψ^=argmaxψ 1Ni=1Nlogp(τiO1:T,ψ)=argmaxψ 1Ni=1Nrψ(τi)logZ\hat{\psi}=\underset{\psi}{\arg\max}\ \frac{1}{N}\sum_{i=1}^{N} \log p(\tau_{i}\mid \mathcal{O}_{1:T},\psi)=\underset{\psi}{\arg\max}\ \frac{1}{N}\sum_{i=1}^{N} r_{\psi}(\tau_{i})-\log Z

where ZZ is the inverse RL partition function, and is equivalent to

Z=p(τ)exprψ(τ)dτZ=\int p(\tau)\exp r_{\psi}(\tau) \,\mathrm{d}\tau

Note that ZZ functions as a "normalizer" for ψ\psi in order to produce a probability distribution, considering that the rewards themselves are unnormalized. Taking the gradient of our loss function, we find that

ψL=1Ni=1Nψrψ(τi)1Zp(τ)exp(rψ(τ))p(τO1:T,ψ)ψrψ(τ)dτ=Eτπ(τ)[ψrψ(τi)]Eτp(τO1:T,ψ)[ψrψ(τ)]\begin{align*} \nabla_{\psi}\mathcal{L} &= \frac{1}{N}\sum_{i=1}^{N} \nabla_{\psi}r_{\psi}(\tau_{i})-\underbrace{ \frac{1}{Z}\int p(\tau)\exp(r_{\psi}(\tau)) }_{ p(\tau \mid \mathcal{O_{1:T},\psi}) }\nabla_{\psi}r_{\psi}(\tau) \,\mathrm{d}\tau \\ &= \mathbb{E}_{\tau \sim \pi^{*}(\tau)}[\nabla_{\psi}r_{\psi}(\tau_{i})]-\mathbb{E}_{\tau \sim p(\tau \mid \mathcal{O}_{1:T},\psi)}[\nabla_{\psi}r_{\psi}(\tau)] \end{align*}

In words, these expectations are over trajectories produced by the...

ψL=Eτπ(τ)[ψrψ(τi)]expert policyEτp(τO1:T,ψ)[ψrψ(τ)]soft optimal policy w/ current reward\nabla_{\psi}\mathcal{L} = \underbrace{ \mathbb{E}_{\tau \sim \pi^{*}(\tau)}[\nabla_{\psi}r_{\psi}(\tau_{i})] }_{ \text{expert policy} }-\underbrace{ \mathbb{E}_{\tau \sim p(\tau \mid \mathcal{O}_{1:T},\psi)}[\nabla_{\psi}r_{\psi}(\tau)] }_{ \text{soft optimal policy w/ current reward} }

For estimating that second expectation, i.e. the expectation over trajectories from our soft optimal policy, we can apply any maximum-entropy RL algorithm (see Lecture 13). Thus, we can just do

ψL1Ni=1Nψrψ(τi)expert samples1Mj=1Mψrψ(τj)policy samples\nabla_{\psi}\mathcal{L}\approx\underbrace{ \frac{1}{N}\sum_{i=1}^{N} \nabla_{\psi}r_{\psi}(\tau_{i}) }_{ \text{expert samples} } -\underbrace{ \frac{1}{M}\sum_{j=1}^{M} \nabla_{\psi}r_{\psi}(\tau_{j}) }_{ \text{policy samples} }

Unfortunately, this method is computationally expensive. (Why? We effectively have two nested loops of gradient ascent; for every step we take on ψ\psi, we have to run an entire maximum-entropy RL algorithm to relearn our soft optimal policy). So, can we do better?

Taking ideas from generalized policy iteration, we can be "lazy" with our relearning of our soft optimal policy; instead of learning an estimate of p(atst,O1:T,ψ)p(\mathbf{a}_{t} \mid \mathbf{s}_{t,}\mathcal{O}_{1:T},\psi) until convergence every time we update rψr_{\psi}, we only make a marginal update (take a few policy gradient/max-entropy algorithm steps on our policy pp) that brings us closer to the soft optimal policy for our new reward function. Sadly, this is not valid and produces a biased estimate—our second expectation is now being performed over samples taken from a non-optimal policy for our reward function...

...wait. Training on samples taken from a trajectory distribution produced by another policy? Isn't that just off-policy learning? So we can just use importance sampling to correct our algorithm!

ψL1Ni=1Nψrψ(τi)1jwjj=1Mwjψrψ(τj)\begin{align*} \nabla_{\psi}\mathcal{L} &\approx \frac{1}{N}\sum_{i=1}^{N} \nabla_{\psi}r_{\psi}(\tau_{i}) - \frac{1}{\sum_{j}w_{j}}\sum_{j=1}^{M} w_{j}\nabla_{\psi}r_{\psi}(\tau_{j}) \end{align*}

where we define

wj=p(τ)exp(rψ(τj))π(τj)=p(s1)tp(st+1st,at)exp(rψ(st,at))p(s1)tp(st+1st,at)π(atst)=exp(trψ(st,at))tπ(atst)\begin{align*} w_{j} &= \frac{p(\tau)\exp(r_{\psi}(\tau_{j}))}{\pi(\tau_{j})}=\frac{p(\mathbf{s}_{1})\prod_{t}p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})\exp(r_{\psi}(\mathbf{s}_{t},\mathbf{a}_{t}))}{p(\mathbf{s}_{1} )\prod_{t}p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})\pi(\mathbf{a}_{t}\mid \mathbf{s}_{t})} \\ &= \frac{\exp\left( \sum_{t}r_{\psi}(\mathbf{s}_{t},\mathbf{a}_{t}) \right)}{\prod_{t}\pi(\mathbf{a}_{t}\mid \mathbf{s}_{t})} \end{align*}

intuitively, this just weights samples according to their reward, computed by the current reward function, relative to the probability of the sample occurring under the non-optimal policy. In other words, reweighting the samples to more heavily favor those that the optimal policy of the current reward function would have been more likely to take.

One may interpret this training procedure as a two-player adversarial game. There is a discriminator, ψ\psi, that attempts to increase the likelihood of samples from the expert demonstrations (positive first expectation Eτπ(τ)\mathbb{E}_{\tau \sim \pi^{*}(\tau)} encourages gradient ascent on/maximization of rewards from the expert trajectories), and simultaneously decreases the likelihood of samples from the policy θ\theta (negative second expectation Eτp(atst,O1:T,ψ)-\mathbb{E}_{\tau \sim p(\mathbf{a}_{t}\mid \mathbf{s}_{t},\mathcal{O}_{1:T},\psi)} encourages gradient descent on/minimization of rewards from policy trajectories). Meanwhile, the generator θ\theta is continuously improved to make it harder for ψ\psi to distinguish between samples from the expert and samples from θ\theta. In this way, we continually improve our reward function and policy in a competing manner, until convergence at an equilibrium where the π=p(atst,O1:T,ψ)\pi^{*}=p(\mathbf{a}_{t}\mid \mathbf{s}_{t},\mathcal{O}_{1:T},\psi), i.e. the policy is encouraged to imitate the expert.

Such game-like models are known as generative adversarial networks (GANs). A discriminator model is trained to distinguish between samples from a dataset and samples from a generator model, while the generator model is optimized to fool the discriminator.

So, let's go back and approach this question from the beginning, but with the intent of designing a GAN for our problem. We still have a policy as our generator model, but instead of a reward model, we'll use a binary discriminator.

ψargmaxψ Eτp[logDψ(τ)]+Eτπθ[log(1Dψ(τ))]\psi\leftarrow \underset{\psi}{\arg\max}\ \mathbb{E}_{\tau \sim p^{*}}[\log D_{\psi}(\tau)]+\mathbb{E}_{\tau \sim \pi_{\theta}}[\log(1-D_{\psi}(\tau))]

and our policy gradient is

θL1Mj=1Mθlogπθ(τj)logDψ(τj)\nabla_{\theta}\mathcal{L}\approx \frac{1}{M}\sum_{j=1}^{M} \nabla_{\theta}\log \pi_{\theta}(\tau_{j})\log D_{\psi}(\tau_{j})

This is much simpler to implement than our inverse RL algorithm. However, at the end of training, since the policy is essentially identical to the expert demonstrations, the discriminator will learn to simply output 0.50.5 probability on every input; that is, the discriminator knows nothing at convergence! Moreover, it's generally not possible to reoptimize the reward function DD.

In short, though, we have the following comparison.

irl-gan.png

RL and LLMs: The Basics

Important questions for applying RL to LLMs:

First, we discuss a short anatomy of LLM training.

  1. Pre-training: Supervised training on a large corpus of data, gives LLM knowledge.

  2. Post-training: Tells LLM how to use its knowledge.

    • Supervised fine-tuning (SFT) (instruction tuning, basically imitation learning) on a smaller, high-quality dataset that demonstrates what you want the LLM to do (e.g. be a helpful model).
    • RL fine-tuning (e.g. reinforcement learning from human feedback/RLHF).

Instruction tuning was the initial LLM post-training methodology. However, it's labor-intensive to collect an SFT dataset, especially for skilled tasks like programming. Reinforcement learning offers a more tractable solution.

There are two main approaches to RL post-training.

So how do we define our MDP for an LLM? There are a couple methods.

The second perspective is more sensible, as it allows for intermediate rewards (rewards during response generation) and value function baselines (which tokens are better than others?).

We'll now try and apply policy gradient to LLM post-training. We have, as usual,

θEπθ(as)[r(s,a)]=Eπθ(as)[θlogπθ(as)r(s,a)]1Niθlogπθ(aisi)r(si,ai)1Niπθ(aisi)πˉ(aisi)θlogπθ(aisi)r(si,ai)\begin{align*} \nabla_{\theta}\mathbb{E}_{\pi_{\theta}(\mathbf{a}\mid \mathbf{s})}[r(\mathbf{s},\mathbf{a})] &= \mathbb{E}_{\pi_{\theta}(\mathbf{a}\mid \mathbf{s})}[\nabla_{\theta}\log \pi_{\theta}(\mathbf{a}\mid \mathbf{s})r(\mathbf{s},\mathbf{a})] \\ &\approx \frac{1}{N}\sum_{i}\nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{i}\mid \mathbf{s}_{i})r(\mathbf{s}_{i},\mathbf{a}_{i}) \\ &\approx \frac{1}{N}\sum_{i} \frac{\pi_{\theta}(\mathbf{a}_{i}\mid \mathbf{s}_{i})}{\bar{\pi}(\mathbf{a}_{i}\mid \mathbf{s}_{i})}\nabla_{\theta}\log \pi_{\theta}(\mathbf{a}_{i}\mid \mathbf{s}_{i})r(\mathbf{s}_{i},\mathbf{a}_{i}) \end{align*}

where the first approximation is a REINFORCE-style estimator and the second approximation is an importance-weighted estimator like PPO. In practice, the second approximation is used for efficiency reasons (multiple gradient steps for each data sampling step).

Algorithms for RL with LLMs

There are a couple issues to address. We have to choose a good...

Baseline

Recall value function baselines and GAE. LLMs typically use a GAE baseline for the objective function. For PPO, we use a loss function for our value function estimator V^ϕπ\hat{V}_{\phi}^{\pi} defined

L(ϕ)=12i=1NV^ϕπ(si)yi2=12i=1NV^ϕπ(si)(A^GAEπ(si,ai)+V^old(si))2\begin{align*} \mathcal{L}(\phi) &= \frac{1}{2}\sum_{i=1}^{N} \lVert \hat{V}_{\phi}^{\pi}(\mathbf{s}_{i})-y_{i} \rVert ^{2} \\ &= \frac{1}{2}\sum_{i=1}^{N} \lVert \hat{V}_{\phi}^{\pi}(\mathbf{s}_{i})-(\hat{A}_{\text{GAE}}^{\pi}(\mathbf{s}_{i},\mathbf{a}_{i})+\hat{V}_{\text{old}}(\mathbf{s}_{i})) \rVert ^{2} \end{align*}

Now, we do need to use our language model to produce estimates of our value function. We can either (1)(1) copy the LLM and train the copy to estimate the value function or (2)(2) add a second head to each output token that estimates the value of that token.

Also, the value function is typically not trained on the prompt, only on the response (the suffix of the prompt, typically, e.g. the last word of "What is the capital of France? Paris").

There's another way to compute baselines with LLMs that does not require an extra copy or head, which is in fact not applicable to general RL. We can estimate our baseline by using decision-time planning: we can "checkpoint" the current state of the LLM, and average across several trajectory samples branching out from the current state. This is possible because we can easily simulate multiple completions from the same LLM state, something that may not be possible for an autonomous driving RL problem. This is known as group relative policy optimization (GRPO), and is much simpler to implement :)

Regularization

Typically, this is done by simply adding a KL divergence term to the reward function with respect to some reference model, commonly the model produced by SFT.

rˉ(st,at)=r(st,at)βDKL(πθ(atst),πref(atst))\bar{r}(\mathbf{s}_{t},\mathbf{a}_{t})=r(\mathbf{s}_{t},\mathbf{a}_{t})-\beta D_{\text{KL}}(\pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t}),\pi_{\text{ref}}(\mathbf{a}_{t}\mid \mathbf{s}_{t}))

Reward Function: Bradley-Terry Model (Preference)

Optometrist Algorithm

Colloquially known as the "optometrist algorithm," since an optometrist asks a patient to compare two sets of lenses until they narrow it down to a good lens.

The algorithm proceeds as follows.

  1. Sample 2 or more trajectories τiπθ(τ)\tau_{i}\sim \pi_{\theta}(\tau).
  2. Ask a person to choose preference(s) in {τi}i=1N\{ \tau_{i} \}_{i=1}^{N} (i.e. pairwise ranking).
  3. Train reward rψr_{\psi} using preferences.
  4. Train policy πθ\pi_{\theta} using rψr_{\psi}.

But, how do we train rψr_{\psi}? We should use a probabilistic model to account for the stochasticity and noise of human decisions. In particular, we apply

p(τiτj)=exp(trψ(st(i),at(i)))exp(trψ(st(i),at(i)))+exp(trψ(st(j),at(j)))p(\tau_{i}\succ \tau_{j})= \frac{\exp\left( \sum_{t}r_{\psi}(\mathbf{s}_{t}^{(i)},\mathbf{a}_{t}^{(i)}) \right)}{\exp\left( \sum_{t}r_{\psi}(\mathbf{s}_{t}^{(i)},\mathbf{a}_{t}^{(i)}) \right)+\exp\left( \sum_{t}r_{\psi}(\mathbf{s}_{t}^{(j)},\mathbf{a}_{t}^{(j)}) \right)}

which is actually just

p(τiτ)=σ(rψ(τi)rψ(τj))p(\tau_{i}\succ \tau)=\sigma(r_{\psi}(\tau_{i})-r_{\psi}(\tau_{j}))
Chess ELO

This is precisely how ELO scores are calculated in chess!

So, to train rψr_{\psi}, we simply do maximum likelihood estimation

ψargmaxψ i,jlogσ(rψ(τi)rψ(τj))\psi\leftarrow \underset{\psi}{\arg\max}\ \sum_{i,j}\log\sigma(r_{\psi}(\tau_{i})-r_{\psi}(\tau_{j}))

Full RLHF Algorithm

  1. Sample 2 or more trajectories τiπθ(τ)\tau_{i}\sim \pi_{\theta}(\tau).
  2. Ask for human preferences {τiτj}i,j\{ \tau_{i}\succ \tau_{j} \}_{i,j}.
  3. Train reward model ψargmaxψ i,jlogσ(rψ(τi)rψ(τj))\psi\leftarrow {\arg\max}_{\psi}\ \sum_{i,j}\log\sigma(r_{\psi}(\tau_{i})-r_{\psi}(\tau_{j})).
  4. Train policy πθ\pi_{\theta} with reward rψr_{\psi}.

Other Reward Sources

RL with Partial Observability

The key difference between full observability MDPs and POMDPs is that POMDPs do not obey the Markov property! In particular, knowledge of st\mathbf{s}_{t} makes knowledge of st1\mathbf{s}_{t-1} useless, but knowledge of ot\mathbf{o}_{t} does not make knowledge of ot1\mathbf{o}_{t-1} useless. Therefore, knowing the history of observations can be useful.

POMDPs have some special properties when compared to normal MDPs.

So, which methods can be applied to POMDPs?

One critical idea in RL with partial observability is to use history states, where st=(o1,,ot)\mathbf{s}_{t}=(\mathbf{o}_{1},\dots,\mathbf{o}_{t}). History states, luckily, do obey the Markov property! This allows us to just use history states with our existing full observability RL methods. We'll discuss this in more detail next lecture :>