Logo

Lecture 18: Offline RL Algorithms

Policy Constraint Methods

Explicit Policy Constraint Methods

To add policy constraints to, e.g., QQ-function actor-critic, we can modify the actor objective

θargmaxθ siDEaπθ(asi)[Q(si,ai)]+λlogπθ(aisi)θargmaxθ siDEaπθ(asi)[Q(si,ai)+λlogπβ(asi)]+λH(πθ(asi))\begin{align*} \theta &\leftarrow \underset{\theta}{\arg\max}\ \sum_{\mathbf{s}_{i}\in \mathcal{D}}\mathbb{E}_{\mathbf{a}\sim \pi_{\theta}(\mathbf{a}\mid \mathbf{s}_{i})}[Q(\mathbf{s}_{i},\mathbf{a}_{i})]+\lambda \log \pi_{\theta}(\mathbf{a}_{i}\mid \mathbf{s}_{i}) \\ \theta &\leftarrow \underset{\theta}{\arg\max}\ \sum_{\mathbf{s}_{i}\in \mathcal{D}}\mathbb{E}_{\mathbf{a}\sim \pi_{\theta}(\mathbf{a}\mid \mathbf{s}_{i})}[Q(\mathbf{s}_{i},\mathbf{a}_{i})+\lambda \log \pi_{\beta}(\mathbf{a}\mid \mathbf{s}_{i})]+\lambda \mathcal{H}(\pi_{\theta}(\mathbf{a}\mid \mathbf{s}_{i})) \end{align*}

for the forward and reverse KL divergence, respectively. Generally, this is not the best way to add policy constraints, but it can work well if done right.

We can also instead modify the reward function. Typically, this is done with reverse KL + MaxEnt RL.

r^(s,a)=r(s,a)λDKL=r(s,a)+λlogπβ(as)\begin{align*} \hat{r}(\mathbf{s},\mathbf{a})&=r(\mathbf{s},\mathbf{a})-\lambda D_{\text{KL}} \\ &= r(\mathbf{s},\mathbf{a})+\lambda \log \pi_{\beta}(\mathbf{a}\mid \mathbf{s}) \end{align*}

This is particularly interesting because it also accounts for future divergence.

Now, we can construct some real offline actor-critic algorithms.

Here's a BRAC-like, behavior regularized actor critic, algorithm. (Behavior Regularized Offline Reinforcement Learning, Wu et al.)

  1. Evaluate yi=r(si,ai)+γEaπθ(asi)[Q^ϕπ(si,ai)λD(πθ(si),πβ(si))]y_{i}=r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \mathbb{E}_{\mathbf{a}'\sim \pi_{\theta}(\mathbf{a}'\mid \mathbf{s}_{i}')}[\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i}',\mathbf{a}_{i})-\lambda D(\pi_{\theta}(\cdot \mid \mathbf{s}_{i}'),\pi_{\beta}(\cdot \mid\mathbf{s}_{i}'))]. (modifying reward)
  2. Update ϕ\phi using ϕi=1BQ^ϕπ(si,ai)yi2\nabla_{\phi}\sum_{i=1}^{B}\lVert \hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a}_{i})-y_{i} \rVert^{2}.
  3. J(θ)=iEaπθ(as,i)[Q^ϕπ(si,a)]λD(πθ(si),πβ(si))J(\theta)=\sum_{i}\mathbb{E}_{\mathbf{a}\sim \pi_{\theta}(\mathbf{a}\mid \boldsymbol{s},_{i})}[\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a})]-\lambda \mathcal{D}(\pi_{\theta}(\cdot \mid \mathbf{s}_{i}),\pi_{\beta}(\cdot \mid \mathbf{s}_{i})). (modifying actor objective)
  4. θθ+αθJ(θ)\theta\leftarrow\theta+\alpha \nabla_{\theta}J(\theta).

And here's an AC+BC-like, actor critic + behavioral cloning?, algorithm.

  1. Evaluate yi=r(si,ai)+γEaπθ(asi)[Q^ϕπ(si,ai)]y_{i}=r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \mathbb{E}_{\mathbf{a}'\sim \pi_{\theta}(\mathbf{a}'\mid \mathbf{s}_{i}')}[\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i}',\mathbf{a}_{i})].
  2. Update ϕ\phi using ϕi=1BQ^ϕπ(si,ai)yi2\nabla_{\phi}\sum_{i=1}^{B}\lVert \hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a}_{i})-y_{i} \rVert^{2}.
  3. J(θ)=iEaπθ(as,i)[Q^ϕπ(si,a)]+λlogπθ(aisi)J(\theta)=\sum_{i}\mathbb{E}_{\mathbf{a}\sim \pi_{\theta}(\mathbf{a}\mid \boldsymbol{s},_{i})}[\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a})]+\lambda \log \pi_{\theta}(\mathbf{a}_{i}\mid \mathbf{s}_{i}). (modifying actor objective with forward KL)
  4. θθ+αθJ(θ)\theta\leftarrow\theta+\alpha \nabla_{\theta}J(\theta).

Implicit Policy Constraint Methods

info

From papers like Advantage-Weighted Regression (Peng et al.) and Accelerating Online Reinforcement Learning with Offline Datasets (Nair et al.). (Both advised by Levine orz).

Implicit policy constraint methods do not add an explicit regularizer term.

Consider, for instance, finding the optimal policy πnew\pi_{\text{new}} over expectation of a QQ-function, subject to a reverse KL divergence constraint.

π=argmaxπ Eaπ(as)[Q(s,a)], DKL(ππβ)ϵ\begin{align*} \pi^{*} &= \underset{\pi}{\arg\max}\ \mathbb{E}_{\mathbf{a}\sim \pi(\mathbf{a}\mid \mathbf{s})}[Q(\mathbf{s},\mathbf{a})],\ D_{\text{KL}}(\pi \parallel \pi_{\beta})\leq \epsilon \end{align*}

Via classical convex optimization techniques, we can derive an exact, closed-form answer. (Note: this is usually not practical due to the size of the action space).

π(as)=1Z(s)πβ(as)exp(1λAπ(s,a))\begin{align*} \pi^{*}(\mathbf{a}\mid \mathbf{s}) &= \frac{1}{Z(\mathbf{s})}\pi_{\beta}(\mathbf{a}\mid \mathbf{s})\exp\left( \frac{1}{\lambda}A^{\pi}(\mathbf{s},\mathbf{a}) \right) \end{align*}

Now, the idea is that we can use importance sampling to effectively approximate samples from π\pi^{*} using samples from πβ\pi_{\beta}, and then use behavioral cloning (supervised learning) to learn from the samples of π\pi^{*}. In particular, we approximate via importance-weighted maximum likelihood, i.e. we produce an update rule

πnew(as)=argmaxπ E(s,a)πβ[logπ(as)1Z(s)exp(1λAπold(s,a))w(s,a)]\pi_{\text{new}}(\mathbf{a}\mid \mathbf{s}) = \underset{\pi}{\arg\max}\ \mathbb{E}_{(\mathbf{s},\mathbf{a})\sim \pi_{\beta}}\left[ \log \pi(\mathbf{a}\mid \mathbf{s}) \underbrace{ \frac{1}{Z(\mathbf{s})}\exp\left( \frac{1}{\lambda}A^{\pi_{\text{old}}}(\mathbf{s},\mathbf{a}) \right) }_{ w(\mathbf{s},\mathbf{a}) } \right]

In other words, maximizing the objective

J(θ)=E(s,a)πβlogπθ(as)exp(Aπθ(s,a))J(\theta) = \mathbb{E}_{(\mathbf{s},\mathbf{a})\sim \pi_{\beta}} \log \pi_{\theta}(\mathbf{a}\mid \mathbf{s})\exp(A^{\pi_{\theta}}(\mathbf{s},\mathbf{a}))

where all terms not dependent on θ\theta were removed, is equivalent to learning a policy πθ\pi_{\theta} that behaviorally clones the optimal policy π\pi^{*} under the reverse KL divergence constraint.

This may translate to an AWAC-like, advantage-weighted actor-critic, algorithm. (Nair et al.)

  1. Evaluate yi=r(si,ai)+γEaπθ(asi)[Q^ϕπ(si,ai)]y_{i}=r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \mathbb{E}_{\mathbf{a}'\sim \pi_{\theta}(\mathbf{a}'\mid \mathbf{s}_{i}')}[\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i}',\mathbf{a}_{i}')].
  2. Update ϕ\phi using ϕi=1BQ^ϕπ(si,ai)yi2\nabla_{\phi}\sum_{i=1}^{B}\lVert \hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a}_{i})-y_{i} \rVert^{2}.
  3. Update ψ\psi using ψi=1BV^ψπ(si)Eaπθ(asi)[Q^ϕπ(si,a)]2\nabla_{\psi}\sum_{i=1}^{B}\lVert \hat{V}_{\psi}^{\pi}(\mathbf{s}_{i})-\mathbb{E}_{\mathbf{a}\sim \pi_{\theta}(\mathbf{a}\mid \mathbf{s}_{i})}[\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a})] \rVert^{2}. (train V^ψπ\hat{V}_{\psi}^{\pi})
  4. J(θ)=ilogπθ(aisi)exp(Q^ϕπ(si,ai)V^ψπ(si))J(\theta)=\sum_{i}\log \pi_{\theta}(\mathbf{a}_{i}\mid \mathbf{s}_{i})\exp(\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a}_{i})-\hat{V}_{\psi}^{\pi}(\mathbf{s}_{i})).
  5. θθ+αθJ(θ)\theta\leftarrow\theta+\alpha \nabla_{\theta}J(\theta).

Note that you can also approximate V^π(si)Qϕπ(si,a)\hat{V}^{\pi}(\mathbf{s}_{i})\approx Q_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a}) where aπθ(asi)\mathbf{a}\sim \pi_{\theta}(\mathbf{a}\mid \mathbf{s}_{i}) if you don't want to train a separate neural network to learn V^π\hat{V}^{\pi}.

This approach is nice for its simplicity; however, it is limited primarily by the fact that, in copying the samples generated by the behavioral policy, it is slow to learn to avoid bad actions.

Implicit QQ-Learning (IQL)

info

Paper: Offline Reinforcement Learning with Implicit Q-Learning (Kostrikov et al.). (Also Levine).

Here's a thought: what if we simply avoided all out-of-distribution actions when performing our QQ-update? (QQ-update needs to maximize over all actions). The intuition is that neural networks can often generalize very well to unseen actions, without explicit consideration of such actions.

Consider performing policy evaluation of the behavior policy. With a typical MSE loss, we might have

Q=minQ(s,a,s)DQ(s,a)[r(s,a)+γV(s)]2V=minV(s,a)DV(s)Q(s,a)2\begin{align*} Q &= \min _{Q} \sum_{(s,a,s')\in \mathcal{D}}\lVert Q(s,a)-[r(s,a)+\gamma V(s')] \rVert ^{2} \\ V &=\min _{V} \sum_{(s,a)\in \mathcal{D}}\lVert V(s)-Q(s,a) \rVert ^{2} \end{align*}

which simply regresses onto the mean of the value of each state/action under the behavior policy. However, we can actually regress onto an upper quantile of these values instead, by changing our loss function.

V=minV(s,a)Dτ(V(s)Q(s,a))\begin{align*} V &= \min _{V}\sum_{(s,a)\in \mathcal{D}}\ell_{\tau}(V(s)-Q(s,a)) \end{align*}

where τ\ell_{\tau} is a function that heavily penalizes VV underestimating QQ, and softly penalizes VV overestimating QQ. It's essentially just MSE with a multiplier on all negative values in the domain. This essentially induces our function estimator to perform implicit maximization over the actions in estimating the value function.

Doesn't this cause erroneous overestimation? Wasn't this an issue in soft actor-critic?

Actually, no. In SAC, overestimation was caused by optimism towards the state transitions. Here, the QQ-function estimator still uses standard MSE to effectively regress onto the state transitions. Only the VV estimator now uses the modified loss function to implicitly maximize over the actions—which is precisely what we want for an RL algorithm. It would overestimate had we implemented, say, implicit maximization with just a QQ-function. (That is, the QQ-function's use of standard MSE ensures no erroneous overestimation).

So, implicit QQ-learning, in practice, uses

Q(s,a)r(s,a)+Eaπ[Q(s,a)]VargminV 1Ni=1N(V(si)Q(si,ai))=2τ={(1τ)x2,x>0τx2,x0\begin{align*} Q(s,a) &\leftarrow r(\mathbf{s},\mathbf{a})+\mathbb{E}_{\mathbf{a}'\sim \pi}[Q(\mathbf{s}',\mathbf{a}')] \\ V &\leftarrow \underset{V}{\arg\min}\ \frac{1}{N}\sum_{i=1}^{N} \ell(V(\mathbf{s}_{i})-Q(\mathbf{s}_{i},\mathbf{a}_{i})) \\ \ell &= \ell_{2}^{\tau}=\left\{ \begin{matrix} (1-\tau)x^{2}, & x>0 \\ \tau x^{2}, & x\leq 0 \end{matrix} \right. \end{align*}

where τ\tau is usually near 11.

Note that it is actually possible to show that this is essentially equivalent to QQ-learning with QQ-updates that maximize only over actions seen in the sample data of the behavioral policy.

Thus, here's an offline-actor critic algorithm with implicit QQ-learning.

  1. Evaluate yi=r(si,ai)+γV^ψ(si)y_{i}=r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \hat{V}_{\psi}(\mathbf{s}_{i}).
  2. Update ϕ\phi using ϕi=1BQ^ϕ(si,ai)yi2\nabla_{\phi}\sum_{i=1}^{B}\lVert \hat{Q}_{\phi}(\mathbf{s}_{i},\mathbf{a}_{i})-y_{i} \rVert^{2}.
  3. Update ψ\psi using ψi=1B2τ(V^ψ(si)Q^ϕ(si,ai))\nabla_{\psi}\sum_{i=1}^{B}\ell_{2}^{\tau}(\hat{V}_{\psi}(\mathbf{s}_{i})-\hat{Q}_{\phi}(\mathbf{s}_{i},\mathbf{a}_{i})).
  4. J(θ)=ilogπθexp(Q^ϕ(si,ai)V^ψ(si))J(\theta)=\sum_{i}\log \pi_{\theta}\exp(\hat{Q}_{\phi}(\mathbf{s}_{i},\mathbf{a}_{i})-\hat{V}_{\psi}(\mathbf{s}_{i})).
  5. θθ+αθJ(θ)\theta\leftarrow\theta+\alpha \nabla_{\theta}J(\theta).

Some interesting things to note:

Conservative QQ-Learning (CQL)

info

Paper: Conservative Q-Learning for Offline Reinforcement Learning (Kumar et al.). (Yes, also Levine).

Let's now return to the idea of pessimism to solve distributional shift.

Our inspiration is to apply some ideas from adversarial training. Let's redefine our QQ function as

Q^π=argminQmaxμαEsD,aμ(as)[Q(s,a)]+E(s,a,s)D[(Q(s,a)(r(s,a)+Eπ[Q(s,a)]))2]\hat{Q}^{\pi}=\arg\min_{Q}\max_{\mu}\alpha \mathbb{E}_{\mathbf{s}\sim D,\mathbf{a}\sim \mu(\mathbf{a}\mid \mathbf{s})}[Q(\mathbf{s},\mathbf{a})]+\mathbb{E}_{(\mathbf{s},\mathbf{a},\mathbf{s}')\sim D}[(Q(\mathbf{s},\mathbf{a})-(r(\mathbf{s},\mathbf{a})+\mathbb{E}_{\pi}[Q(\mathbf{s}',\mathbf{a}')]))^{2}]

where the first term "pushes down" on large QQ-values, and the second term is just our regular QQ-learning objective. (μ\mu is our discriminator, the policy π\pi is our generator, in GAN terms). Note that one may show that Q^πQπ\hat{Q}^{\pi}\leq Q^{\pi} for large enough α\alpha, or that our Q^π\hat{Q}^{\pi} is a lower bound on the actual QQ function.

There's one issue with this solution, though—this will produce a systematic underestimate on actions close to the data. A better estimate is

Q^π=argminQmaxμ αEsD,aμ(as)[Q(s,a)]αE(s,a)D[Q(s,a)]+E(s,a,s)D[(Q(s,a)(r(s,a)+Eπ[Q(s,a)]))2]\begin{align*} \hat{Q}^{\pi} =\arg\min_{Q}\max_{\mu}\ &\alpha \mathbb{E}_{\mathbf{s}\sim D,\mathbf{a}\sim \mu(\mathbf{a}\mid \mathbf{s})}[Q(\mathbf{s},\mathbf{a})]-\alpha \mathbb{E}_{(\mathbf{s},\mathbf{a})\sim D}[Q(\mathbf{s},\mathbf{a})] \\ &+\mathbb{E}_{(\mathbf{s},\mathbf{a},\mathbf{s}')\sim D}[(Q(\mathbf{s},\mathbf{a})-(r(\mathbf{s},\mathbf{a})+\mathbb{E}_{\pi}[Q(\mathbf{s}',\mathbf{a}')]))^{2}] \end{align*}

This new term essentially just pushes back up on (s,a)(\mathbf{s},\mathbf{a}) samples in our data, and effectively cancels out with the first term. Notably, we no longer have the guarantee that Q^πQπ\hat{Q}^{\pi}\leq Q^{\pi} for all (s,a)(\mathbf{s},\mathbf{a}), but we are guaranteed that Eπ(as)[Q^π(s,a)]Eπ(as)[Qπ(s,a)]\mathbb{E}_{\pi(\mathbf{a}\mid \mathbf{s})}[\hat{Q}^{\pi}(\mathbf{s},\mathbf{a})]\leq \mathbb{E}_{\pi(\mathbf{a}\mid \mathbf{s})}[Q^{\pi}(\mathbf{s},\mathbf{a})] for all sD\mathbf{s} \in D.

Thus, we have a basic conservative QQ-learning algorithm.

  1. Update Q^π\hat{Q}^{\pi} w.r.t. LCQL(Q^π)\mathcal{L}_{\text{CQL}}(\hat{Q}^{\pi}) using D\mathcal{D}.
  2. Update policy π\pi, dependent on discrete/continuous action space.

We've left something out though in our discussion so far—what is this μ\mu term? Well, it's the distribution that maximizes that inner term; however, without any regularization, it's extremely unstable over the course of training. Typically, we'll add a regularization term R(μ)\mathcal{R}(\mu) to the loss function, i.e.

Q^π=argminQmaxμ αEsD,aμ(as)[Q(s,a)]αE(s,a)D[Q(s,a)]R(μ)+E(s,a,s)D[(Q(s,a)(r(s,a)+Eπ[Q(s,a)]))2]\begin{align*} \hat{Q}^{\pi} =\arg\min_{Q}\max_{\mu}\ &\alpha \mathbb{E}_{\mathbf{s}\sim D,\mathbf{a}\sim \mu(\mathbf{a}\mid \mathbf{s})}[Q(\mathbf{s},\mathbf{a})]-\alpha \mathbb{E}_{(\mathbf{s},\mathbf{a})\sim D}[Q(\mathbf{s},\mathbf{a})]-\mathcal{R}(\mu) \\ &+\mathbb{E}_{(\mathbf{s},\mathbf{a},\mathbf{s}')\sim D}[(Q(\mathbf{s},\mathbf{a})-(r(\mathbf{s},\mathbf{a})+\mathbb{E}_{\pi}[Q(\mathbf{s}',\mathbf{a}')]))^{2}] \end{align*}

A common choice for R(μ)\mathcal{R}(\mu) is EsD[H(μ(s))]\mathbb{E}_{\mathbf{s}\sim D}[\mathcal{H}(\mu(\cdot \mid \mathbf{s}))], or maximum entropy regularization. In this case, after learning, μ\mu becomes proportional to exp(Q(s,a))\exp(Q(\mathbf{s},\mathbf{a})). Notably, we can either represent μ\mu as an actual, explicit learned policy, or we can note that

Eaμ(as)[Q(s,a)]=logaexp(Q(s,a))\mathbb{E}_{\mathbf{a}\sim \mu(\mathbf{a}\mid \mathbf{s})}[Q(\mathbf{s},\mathbf{a})]=\log \sum_{\mathbf{a}}\exp(Q(\mathbf{s},\mathbf{a}))

For discrete actions, we can calculate this quantity directly, while for continuous actions, we can use importance sampling to estimate this quantity.

What if you didn't use a regularizer?

If you didn't add a regularizer R(μ)\mathcal{R}(\mu), the maximizing μ\mu is a distribution that assigns probability 11 to the highest QQ-value. Because this may change between every iteration, it is potentially unstable, and therefore slows learning. With maximum entropy regularization, the distribution is encouraged to spread out a bit more over the large QQ-values.

Offline-to-online RL

Now, let's turn to the problem of offline-to-online RL: using offline RL to pretrain a model, and then using online RL to fine-tune the model.

At the time of writing (March 2026), this is an active area of research. Take all following discussions with a grain of salt.

Okay, so what if we just run an offline RL algorithm, e.g. CQL, and then run a standard online RL algorithm starting from the same point (same policy, same QQ-function)? Does that just work?

Unfortunately, not quite. See below.

offline-to-online.png

For CQL, in particular, which this graph was generated from, during the first phase of online training, the model will realize that it has drastically underestimated the QQ-function for some previously unobserved actions, due to the pessimism of CQL. Other offline RL algorithms have similar, slightly different versions of this same problem.

The problem is the differences between the training phases.

Offline:

Online:

There is, notably, an "embarrassingly effective" method that is not really offline RL at all, yet outperformed, for a long time, many offline RL methods in using offline data for online methods. (Efficient Online Reinforcement Learning with Offline Data, Ball et al.).

  1. Initialize two buffers: online replay buffer and offline data buffer
  2. Initialize value function and actor from scratch.
  3. Run online RL; for every batch, sample half from offline buffer and half from replay buffer.

Obviously, this method is unsatisfying because it involves absolutely zero pretraining. Unfortunately, there isn't a clear solution to this; however, in recent years, algorithms that use diffusion models or flow matching to represent the actor have empirically produced methods with pretraining that do improve over the "embarrassingly effective" method.

It's unknown why exactly they work well, but here is one theory. In the online case, optimal πθ(as)\pi_{\theta}(\mathbf{a}\mid \mathbf{s}) is deterministic; thus, there's no need to capture some multimodal distribution for the policy. In the offline case, capturing only the best mode in the data is fine... but when capturing multiple modes, it may be easier to handle policy constraints. So, for offline-to-online, perhaps it helps to just track all the modes in the offline phase, and then focus on the best mode in the online phase.

However, it's hard to use diffusion as the actor in RL. Optimizing the objective requires either θlogπθ(as)\nabla_{\theta}\log \pi_{\theta}(\mathbf{a}\mid \mathbf{s}) (policy gradient) or backpropagation through the diffusion process (reparameterization). Policy gradient is inaccessible for diffusion/flow matching, and backprop is often computationally expensive and unstable (backprop through time, BPTT).

Let's discuss some methods that have used diffusion models, and what they did to solve the above issue.

IDQL

Simple offline-to-online RL with diffusion model. (IDQL: Implicit Q-Learning as an actor-critic method with diffusion policies, Hansen-Estruch et al.)

  1. Train Q^ϕ(s,a)\hat{Q}_{\phi}(\mathbf{s},\mathbf{a}) without any actor (e.g. IQL)
  2. Train πθ(as)\pi_{\theta}(\mathbf{a}\mid \mathbf{s}) as a diffusion/flow model with behavioral cloning, leads to πθπβ\pi_{\theta}\approx \pi_{\beta}.
  3. At test time, sample {a1,,aK}\{ \mathbf{a}_{1},\dots,\mathbf{a}_{K} \} from πθ(as)\pi_{\theta}(\mathbf{a}\mid \mathbf{s}), and pick argmaxakQ^ϕ(s,ak)\arg\max_{\mathbf{a}_{k}}\hat{Q}_{\phi}(\mathbf{s},\mathbf{a}_{k}).

The point is that no out-of-distribution actions are chosen because πθπβ\pi_{\theta}\approx \pi_{\beta}. This is somewhat reminiscent of stochastic optimization/random shooting from Lecture 16. However, this works surprisingly well, particularly if the data produced by πβ\pi_{\beta} is somewhat decent.

FQL

An actor that stays "close" to diffusion model. (Flow Q-Learning, Park et al.)

  1. Train πflow(as,z)\pi_{\text{flow}}(\mathbf{a}\mid \mathbf{s},\mathbf{z}) as diffusion/flow model with behavioral cloning, where z\mathbf{z} is the input noise in flow matching. This leads to πflowπβ\pi_{\text{flow}}\approx \pi_{\beta}.
  2. Run offline actor-critic with a special behavioral cloning regularizer, training the actor πθ(as,z)\pi_{\theta}(\mathbf{a}\mid \mathbf{s},\mathbf{z}) to stay close to πflow(as,z)\pi_{\text{flow}}(\mathbf{a}\mid \mathbf{s},\mathbf{z}), given both s\mathbf{s} and z\mathbf{z} as input. Notably, the actor is just a regular neural network, not a flow model, that is essentially distilling the behavior of the flow model. In particular, the objective for the actor is
J(θ)=iEzp(z)[Eaπθ(asi,z)[Q^ϕπ(si,a)]+Eaπflow(as,z)[λlogπθ(ais,z)]]J(\theta) = \sum_{i}\mathbb{E}_{\mathbf{z}\sim p(\mathbf{z})}\big[\mathbb{E}_{\mathbf{a}\sim \pi_{\theta}(\mathbf{a}\mid \mathbf{s}_{i},\mathbf{z})}[\hat{Q}_{\phi}^{\pi}(\mathbf{s}_{i},\mathbf{a})]+\mathbb{E}_{\mathbf{a}\sim \pi_{\text{flow}}(\mathbf{a}\mid \mathbf{s},\mathbf{z})}[\lambda \log \pi_{\theta}(\mathbf{a}_{i}\mid \mathbf{s},\mathbf{z})]\big]

DSRL

Diffusion Steering via Reinforcement Learning. (Steering Your Diffusion Policy with Latent Space Reinforcement Learning, Wagenmaker et al.)

Intuitively, a diffusion model actually produces an action space of only in-distribution actions. So, what if, instead of our actor model producing an action from the general action space, which may be out-of-distribution, we use our actor model to produce a value in the latent space of the diffusion model, i.e. the noise distribution, and then feed that into the diffusion model to produce an in-distribution action. That is, just run an efficient online RL algorithm in the latent space of a diffusion model!

Model-based Offline RL

The critical concern in model-based offline RL is that we'd like to limit the impact of the distributional shift of the environment model itself, because we are unable to collect more data to correct the model's errors. (counterfactual questions for the model, rather than for the QQ-function like previously).

MOPO

MOPO: Model-Based Offline Policy Optimization (Yu et al.)

Also, MOReL: Model-Based Offline Reinforcement Learning (Kidambi et al.)

One solution is to simply adjust the reward function to be pessimistic about OOD states, i.e. to "punish" the policy for exploiting OOD states. The reward is adjusted as follows

r~(s,a)=r(s,a)λu(s,a)\tilde{r}(s,a)=r(s,a)-\lambda u(s,a)

where u(s,a)u(s,a) represents an uncertainty penalty. This can be computed by e.g. measuring disagreement between ensemble models. Subsequently, simply run any existing model-based RL method.

COMBO

COMBO: Conservative Offline Model-Based Policy Optimization (Yu et al.)

Alternatively, we can leverage the same ideas as in [[#Conservative QQ-Learning (CQL)|CQL]]. Similar to how CQL "pushes down" on large QQ-values, model-based RL can "push down" on the QQ-values of model state-action tuples. In essence, for a model pp, we use a QQ-function update rule of

Q^k+1argminQ β(Es,ap(s,a)[Q(s,a)]Es,aD[Q(s,a)])+12Es,a,sdf[(Q(s,a)β^πQ^k(s,a))2]\hat{Q}^{k+1} \leftarrow \underset{Q}{\arg\min}\ \beta(\mathbb{E}_{\mathbf{s},\mathbf{a}\sim p(\mathbf{s},\mathbf{a})}[Q(\mathbf{s},\mathbf{a})]-\mathbb{E}_{\mathbf{s},\mathbf{a}\sim \mathcal{D}}[Q(\mathbf{s},\mathbf{a})])+\frac{1}{2}\mathbb{E}_{\mathbf{s},\mathbf{a},\mathbf{s}'\sim \boldsymbol{d}_{\boldsymbol{f}}}[(Q(\mathbf{s},\mathbf{a})-\hat{\beta}^{\pi}\hat{Q}^{k}(\mathbf{s},\mathbf{a}))^{2}]

Again, the intuition is the same. It's like a GAN: if the model, the generator, produces something that looks clearly different from real data, the QQ-function, the discriminator, will assign low values to it.