Logo

Lecture 3

Solution 2: Better Models, Less Mistakes

Why might a model fail to fit to the expert (training data)? The human expert may exhibit...

Non-Markovian Behavior

Instead of πθ(atot)\pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{o}_{t}), a human's policy resembles something more like πθ(ato1,,ot)\pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{o}_{1},\dots,\mathbf{o}_{t}), in which the current behavior depends on all past observations, not just the current observation. This sort of knowledge can be encoded by using some sort of sequence model, e.g. transformer, to process each observation "frame" in history.

info

If there is a bijection between observations and states (i.e. st=ot\mathbf{s}_{t}=\mathbf{o}_{t}), then there exists a Markovian policy (i.e. doesn't use history) that is optimal. If the observations are not sufficient to infer the full state, then the optimal policy may require knowledge of history.

However, including history can degrade model performance. Why?

Causal confusion is a particularly notable consequence of an excess of information. In essence, a model's action AA causes an event BB, which results in AA and BB showing up together very frequently. As the model makes associations, not causal relationships, the occurrence of BB can actually influence the model into doing AA. This phenomenon is known as spurious correlation.

Brake Indicator

When the car brake is depressed (AA), the brake indicator lights up (BB). As the brake remains depressed over the next several time steps, so too will the brake indicator remain lit up. Then, during inference, if the model sees the brake indicator light up, it may likely choose to depress the brake because it has associated action AA with event BB.

DAgger?

DAgger, surprisingly, is effective for fixing causal confusion for the simple reason that DAgger addresses distributional shift.

Multimodal Behavior

Multimodal behavior describes environments where there exist states for which there are multiple reasonable actions. This becomes an issue with continuous actions, in which a unimodal distribution (e.g. Gaussian) would essentially average the reasonable actions together to produce an action that may not be reasonable at all! There are a couple solutions to this.

Discretization

Can we always just naively discretize continuous action spaces? No, because the number of discrete "bins" is exponential in the dimensionality of the action space. For low-dimensional action spaces, however, this is a great solution.

One may suggest discretizing one dimension at a time to avoid the exponential explosion; however, this is not necessarily effective because it suggests choosing along each dimensions independently when, in fact, the dimensions may covary with each other.

example

For instance, consider a highway with a fast lane on the right and a slow lane on the left. One dimension of the action space is {fast,slow}\{ \text{fast},\text{slow} \}, and another dimensions of the action space is {right,left}\{ \text{right},\text{left} \}. However, the two dimensions are not independent—the only good actions here are (fast,right)(\text{fast},\text{right}) and (slow,left)(\text{slow},\text{left}).

Autoregressive discretization provides a computationally efficient (linear, not exponential) solution for high-dimensional action spaces. In essence, this discretizes one dimension at a time, but in sequence, rather than independently. In essence, the sequence model (e.g. transformer), when predicting dimension ii of the action space, is fed the original input (state/observations) and the previously predicted dimensions, i.e. dimensions 0,,i10,\dots,i-1. In short, the dimensions are discretized one a time and then decoded autoregressively (hence the nomenclature).

This is valid by the Chain Rule of Probability. If we denote dimension ii of action at\mathbf{a}_{t} as at,ia_{t,i}, then

p(atst)=p(at,0,,at,nst)=i=0np(at,ist,at,0,,at,i1)p(\mathbf{a}_{t}\mid \mathbf{s}_{t})=p(a_{t,0},\dots ,a_{t,n}\mid \mathbf{s}_{t})=\prod_{i=0}^{n} p(a_{t,i}\mid \mathbf{s}_{t},a_{t,0},\dots ,a_{t,i-1})

In other words, this is equivalent to predicting all dimensions of the action at once based on the state while still providing an efficient method of discretization.

Expressive Continuous Distributions

Generally speaking, it's hard to model multimodal distributions. Instead, we can include an additional input that "decides" the mode that the model chooses; this additional input can simply be some random noise.

The main challenge with this method, however, is that we must train the model to actually use the random noise to decide the mode—otherwise, of course, the noise is useless. There are several different solutions for this.

Diffusion/Flow Matching

Diffusion is the most popular approach for generating high dimensional, continuous data. The key principle underlying diffusion is that turning random noise into a specific image (e.g. image of a dog) is hard, but turning an image of a dog into random noise is easy (just add more noise). So, to train a model to generate images of dogs, we take real images of dogs, iteratively add noise, and then train the model to simply go the other way!

diffusion.png Source: GeeksforGeeks

Flow matching is the modern version of diffusion, with essentially the same intuition, except it's easier to implement. In essence, we want to train a model to start with a noise distribution, e.g. Gaussian, and transform samples from it into a desired data distribution. It relies on the same idea of starting with real data samples and adding noise, and simply training the model to reverse this process. Formally, flow matching learns a vector field v(xt,t)v(\mathbf{x}_{t},t) that allows "sampling from" (modeling) p(x)p(\mathbf{x}) by sampling x0p0(x0)\mathbf{x}_{0}\sim p_{0}(\mathbf{x}_{0}) (noise distribution) and integrating the vector field to produce x1=x0+01v(xt,t)dt\mathbf{x}_{1}=\mathbf{x}_{0}+\int_{0}^{1} v(\mathbf{x}_{t},t) \, \mathrm{d}t. In practice, this integration is discretized as Euler integration. Below is a demonstration of flow matching in action.

flow_matching.mp4 Source

But how do we train flow matching?

  1. Sample x0N(0,I)\mathbf{x}_{0}\sim \mathcal{N}(0,\mathbf{I}) (i.e. Gaussian noise distribution)
  2. Sample x1D\mathbf{x}_{1}\in \mathcal{D} where D={x(i)}i=1N\mathcal{D}=\{ \mathbf{x}^{(i)} \}_{i=1}^{N} (data distribution)
  3. Sample tp(t)t\sim p(t) where p(t)p(t) is defined only over [0,1][0,1] (e.g. p(t)=U(0,1)p(t)=\mathcal{U}(0,1)), as this is the selection of the "time step" tt of the vector field in-between t=0t=0, the noise distribution, and t=1t=1, the data distribution. See the below diagram for clarification.
  4. Compute xt=tx1+(1t)x0\mathbf{x}_{t}=t\mathbf{x}_{1}+(1-t)\mathbf{x}_{0} (i.e. just draw a line)
  5. Update vv with v(xt,t)(x1x0)2\nabla \lVert v(\mathbf{x}_{t},t)-(\mathbf{x}_{1}-\mathbf{x}_{0}) \rVert^{2}, or update the velocity (that point in the vector field) to point more towards x1\mathbf{x}_{1} (i.e. intended velocity is the slope of the line (x1x0)(\mathbf{x}_{1}-\mathbf{x}_{0}))

flow-matching-training.png

The point of this with regards to fixing the averaging problem is that the vector field produced by flow matching will eventually diverge to produce one of the reasonable actions. In earlier "time" steps (tt closer to 00), the vector field appears more averaged across reasonable actions. When tt is closer to 11, the vector field starts to diverge. In essence, the supervision of many linear vector fields produces an overall vector field that effectively maps the noise distribution to the target distribution, effectively representing multimodality.

In RL, flow matching is applied as follows.

  1. Construct minibatch. For each element in batch jj,
    1. Sample (ot(j),at(j))(\mathbf{o}_{t}^{(j)},\mathbf{a}_{t}^{(j)}) from dataset (data distribution)
    2. Sample at,0(j)N(0,I)\mathbf{a}_{t,0}^{(j)}\sim \mathcal{N}(0,\mathbf{I}) (noise distribution)
    3. Sample τ(j)p(τ)\tau^{(j)}\sim p(\tau) (time step)
    4. Compute ai,τ(j)=τ(j)at(j)+(1τ(j))at,0(j)\mathbf{a}_{i,\tau}^{(j)}=\tau^{(j)}\mathbf{a}_{t}^{(j)}+(1-\tau^{(j)})\mathbf{a}_{t,0}^{(j)}.
  2. Update θθ+αθL\theta\leftarrow\theta+\alpha \nabla_{\theta}\mathcal{L}, where L=j=1Bvθ(ot(j),at,τ(j),τ(j))(at(j)at,0(j))2\mathcal{L}=\sum_{j=1}^{B}\lVert v_{\theta}(\mathbf{o}_{t}^{(j)},\mathbf{a}_{t,\tau}^{(j)},\tau^{(j)})-(\mathbf{a}_{t}^{(j)}-\mathbf{a}_{t,0}^{(j)}) \rVert^{2}.

That is, when applying flow matching to RL, the model is predicting the vector field/velocity itself. In particular, the model is represented by vθ(ot(j),at,τ(j),τ(j))v_{\theta}(\mathbf{o}_{t}^{(j)},\mathbf{a}_{t,\tau}^{(j)},\tau^{(j)}).

Action Chunking

Action chunking asks a model to predict a chunk or sequence of the next KK actions, after which the actor will execute the first kk actions, where 1<kK1<k\leq K. It's unknown exactly why or when this improves performance for imitation learning, but it frequently does!

Why only execute kk actions?

There is no definitive reason, but it seems that it provides additional training signal for the model.

Solution 3: Narrow vs Broad Data

In practice, how you collect/augment your datasets has a substantial impact on model performance.

Mistakes and Corrections

One common method is to intentionally add mistakes (and their corresponding corrections) to the dataset. This ensures the dataset isn't "too good," and that the model can learn to recover from the mistakes, and the idea is that the inclusion of corrections helps more than the mistakes hurt the model. This addition of data may even be synthetic.

Pre-Training

Essentially, the motivation is the same as the mistake-augmentation; we want to show the model bad situations and how to recover from them, but not to enter those situations. So, we run two steps of training.

  1. Pre-training phase: train the model on a very large, but low quality dataset that may include many mistakes.
  2. Post-training phase: train the model on a smaller, but curated/high quality dataset that only includes good examples. (This is known as fine-tuning).

Solution 4: Multi-task Learning

Consider training a car to drive to a single point p1\mathbf{p}_{1}, with policy πθ(as)\pi_{\theta}(\mathbf{a}\mid \mathbf{s}). An example of a multi-task version of this problem would be training a car to drive to any of a set of points {p1,,pn}\{ \mathbf{p}_{1},\dots,\mathbf{p}_{n} \}, with policy πθ(as,p)\pi_{\theta}(\mathbf{a}\mid \mathbf{s},\mathbf{p}), i.e. conditioned on some choice of p{p1,,pn}\mathbf{p}\in \{ \mathbf{p}_{1},\dots,\mathbf{p}_{n} \}.

The most obvious benefit of this is that this allows a much larger and varied corpus of training data, and this is more formally known as goal-conditioned behavioral cloning, in which the model is trained on a dataset of several tasks, but it takes in, as input, the desired goal—in this case, which point p\mathbf{p} the car should drive to.