Lifting the Conceptual Bludgeon Off On and Off Policy Reinforcement Learning
31 Dec 2025I started adding deadlifting back to my gym routine too!
A Stab at the Definition of On and Off Policy Reinforcement Learning
I wrote this from my heart:
In the RL paradigm, there is an agent that is trained on the data it produces. Some people call this learning from trial and error. Sometimes, the data generating distribution (which be produced by the reference policy, the actor so to speak) is distributionally different than the policy being trained (sometimes called the target policy)
Why is this bad?
There are many reasons why one would train with the on-policy rl paradigm. One of them is stable optimization, meaning that the policy converges to the intended or optimal distribution, something which importance weighting tries to do, by re-weighting empirical data so that the estimate of the expected return, part of the optimization target, is computed more accurately, or is an unbiased estimator.
The goodness of on-policyness can even be linked to capacity efficiency, i.e. why approximating with low rank adapaters and reinforcement learning on the same dataset achieves the same TestNLL, or some measure of generalization error, as full capacity SFT.
But isn’t this kind of ambiguous?
Let’s say that you have an inference and training setup in which the inference server does \(B\) episodes/trajectories then collects the data and does a weight update, then broadcasts the new weights to the inference servers. Then during the weight update, if the optimization batch size is \(b << B\), then essentially off policy RL is happening. The batch of \(B\) episodes/trajectories is basically a replay buffer.
In other words, in the wild, foundation models are usually distributed across GPUs (which can be in a given GPU cluster or node or across nodes), so in particular inference is on a different computer than training. There are a bunch of things to optimize in such a setup, but for the wild (the wild on-policy RL!), we’d like to maximize GPU utilization while staying as on-policy as possible (who knows what unit that is measured in? I would really like to know). In the paradigm of in flight updates, after each optimizer (optim.step() in your torch or whatever code) step, updated weights are broadcast to the inference server. This is really cool. Would be even cooler to customize the step delay to broadcast. In some sort of limit, some sort of repeated off-policy rl is basically on-policy rl!
Btw, here’s how to do importance weighting
To fix the off-policyness, i.e. when estimating \(\mathbb{E}_{\pi}[R]\) (aka the expected return), where the expectation is integrated or summed with \(\pi\) with data generated by a different behavior policy \(\mu\), multiply the observed reward / return by this ratio \(\frac{\pi(a|s)}{\mu(a|s)}\). Intuitively, it reweights the implicity \(\mu\) which informs the actual data that is generated by the actor. In other words, it corrects for the probability term in the expectation.
Some Example of On and Off Policy in a Real World Scenario, Princess Version
I love Waymo. I think I actually feel safer in Waymos. The other day I scheduled a Rivian test drive because I was SO BORED. Rivian is like an electric version of Range Rover. I love Range Rovers too.
I was thinking through some real world examples of distributional challenges and reinforcement learning!
Let’s think about training a Waymo, but holding all other agents constant (I know, BAD!) as some factorization of the environment and reward, yay partially observable MDPs!
What I really want to get at here are a bunch of different concepts in a real world application, where we want to deploy a safe and accurate policy. How I imagine training a Waymo would go. Let’s say you have historical data \(D_{H}\) from a human driving a car in the format \(s, a, r, s, ...\). You can train an initial policy via behavior cloning / imitation learning / SFT on a base model.
Depending on the human driver, this may be a conservative or a risky policy. Waymos deployed in the real world are actually very conservative in the sense that they don’t run red lights, they stop for pedestrians, et cetera. However, we would like to train a policy which learns to act in risky scenarios, to more fully cover the \((state, action)\) distribution, and learns via reward / trial and error.
After training an initial policy via behavior cloning, one may tweak the conservativeness of the policy (via things like temperature (if doing a softmax policy), entropy bonus in the on-policy optimization equation, I would love to know more) during a simulated test drive. This is when there are still humans in the Waymo car (rlhf!), during the on-policy simulated driving training session, for the (s,a) coverage. This is in some sense a risky policy which gets direct human feedback.
To shift a risky policy back to being more conservative, you could probably train it off-policy via more conservative \(s,a,r,s...\).
More soon.