stochastic parrots of post raining 🌧️ ☔️ post training reinforcement learning reasoning

What sparked my curiousity? The long reflection tokens from the m1 paper which sent me reeling in a chat with my friend. Then I read another paper, the Qwen3, where they talked about thinking and non thinking mode fusion, and I was like, is that really that difficult? It seems you can mix and match any sequence of domain-specific foundation model rollouts that by appending transition tokens which functionally are similar to conditional logic and, or, where, etc …, fine-tune on those rollouts, rinse and repeat.

The long reflection tokens from the m1, ‘However, Recheck, Wait, Aha’ were apparently very important tokens for the reasoning paths for stabilizing entropy of the learned policy, which is apparently important for downstream reasoning rl performance. To preserve these tokens, which, due to their low \(\pi_{ref}\) weight in the denominator of the \(IS\), which creates a high \(\dfrac{\pi_{cur}}{\pi_{ref}}\) for the advantage (\(\sum_{i=1}^t r_i - V\)), were clipped out of the PPO (and GSPO and GRPO) updates, the CIPSO objective from the m1 paper simply adds a stop gradient operation, so the high IS = \(\dfrac{\pi_{cur}}{\pi_{ref}}\) term does not explode the gradient update in the chain rule of backpropagation, and tokens (‘However, Recheck, Wait, Aha’) are assigned credit in the update to increase their log_proba.

PPO Objective Function Minus the KL Term

\(J(\theta) = \mathbb{E}[\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} min(r_{i,t}A_{i,t}, clip(r_{i,t}, 1-\epsilon, 1+\epsilon)A_{i,t}) ]\)

GRPO Objective function

\[J(\theta) = \mathbb{E}[\sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} min(r_{i,t}A_{i,t}, clip(r_{i,t}, 1-\epsilon, 1+\epsilon)A_{i,t}) ]\]

CISPO Objective function

\(J(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i=1}^{G} |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} sg(r_{i,t})A_{i,t} \log \pi_{\theta}(o_{i,t} \mid q, o_{i,\text{prev}}) M_{i,t} \right]\)

where \(M_{i,t} = 0\) if \(A_{i,t} > 0\) and \(r > 1 + \epsilon_{high}\), \(M_{i,t} = 0\) if \(A_{i,t} < 0\) and \(r < 1 + \epsilon_{low}\) and \(1\) otherwise.

The main insight from PPO to GRPO, the “group part”, is that one may approximate the baseline term (V) in an advantage computation \(A = R - V\) with the average summed return of the group rollouts, the many completions for a fixed prompt, the (Prover, Verifier pairs).

The main insight from GRPO to CISPO is that transforming the clipping of \(r_{i,t}A_{i,t}\) into a mask allows one to inspect \(r_{i,t}\) and \(A_{i,t}\) and mask them out if both are over or under flow. I assume the ‘Aha’, ‘Wait!’ tokens have consistently high \(A_{i,t}\) but decreasing \(r_{i,t}\) as the policy learning weights them higher over time.

I shouted to S.Z., why was the entire section about the long reflection tokens in the m1 paper necessary??? I wonder if one studied the long reflection tokens from the m1 papers in Euclidean Space post post-training they would belong in the same subspace. Why am I always becoming an interpretability girlie, against my better judgment?! She said, I wonder about the false positives for the reflection tokens? I said, I can imagine some trigger words. (…) Who would have thought that one could simply inspect conditional probabilities?

The KL term, these days, is left out of most reasoning model objective functions as policies deviate wildly from the reference policy anyway. I shouted to A.L., the rollout completion length penalty in section 2.2.3 of the magistral paper makes no sense! Why not just append a STOP_THINKING token after correct reasoning traces and move this upstream to the Long CoT cold start behavior imitation stage of the post raining post training reasoning rl pipeline? I finally figured out how to do las vegas algorithms with foundation models, I had been thinking forever. Computability theory, my OG.

Anyway the point of this above sections was to point out the similarity between the fork in the road reflection tokens in the reasoning rollouts and the thinking fusion mode from Qwen3, at different layers of abstraction.

Entropy in the RL Reasoning Phase

Papers such as this one talk about the importance of the entropy of the optimized policy for downstream performance, fitting the equation \(R = -a * e^{H} + b\) (H is always between 0 and 1 so \(e^{H}\) is an convex curve inverted by the \(-a\). The downstream evaluations are datasets such as OMNI-BENCH, AIME 2024, et cetera. (Recall that foundation models have a log_proba or proba function which allows you to compute \(H = - p \sum \log p\)).

There are several ways to optimize for entropy of a reasoning RL policy

  1. Entropy Bonus in the objective function

I imagine this is adding \(- \beta H(x)\) in the objective function.

  1. \(\epsilon_{high}\) in the clip function of a GRPO objective

This is what magistral does, they say they that entropy bonus of method 1 causes instability. The basic modification to the GRPO function is to adjust the clipping threshold.

\[J(\theta) = \mathbb{E}[\frac{1}{|o_i|}] \sum_{t=1}^{|o_i|} min(r_{i,t}A_{i,t}, clip(r_{i,t}, 1-\epsilon_{low}, 1+\epsilon_{high})A_{i,t})\]

\(\epsilon_{high}\) allows for \(\pi_{cur}\) to deviate from \(\pi_{ref}\), which adjusting the $\beta$ term on a KL penalty could do.

  1. Converting the clip into a stop_gradient and masking in the CISPO objective

The stop_gradient is implement by returning None in the backward pass of an autodifferentiation graph, treating the variable like a constant. The difference with clipping is that the high IS term \(r_{i,t}\) is part of the loss computation and weights the fork in the road tokens which contribute to high entropy (exploration) in the loss function accordingly. I’m not sure what the m1 paper means by ‘clipped_out’ as a token, my guess is that the token’s log_proba needs to be high enough relative to the non exploration tokens. I’m curious about this.

I wonder, how much entropy is too much entropy? I wish the entropy mechanism paper had just reported on an exact value of \(H\), if it exists, for the downstream tasks. I suppose the predictive equation is more flexibile as tasks become ever more computationally difficult.

KL Distillation

They took out the KL term in the RL phase of the pipeline (recall that the m1 and the magistral and probably a few other papers do not have a KL term in the RL objective)! But KL distillation, ah the good teacher and the professor forcing (so to speak, those are the titles of the papers!) can be used to train lightweight smaller models from rollouts from the heavyweight models (trained through the (pretraining -> Long CoT cold start -> reasoning RL -> thinking mode fusion -> general RL) pipeline). It seems inefficient that one must have log_proba access to the heavyweight model logits to compute this update, i.e. to do logit distillation.

Thank you to SZ for amplifying my inspiration with random words like rollouts and transformers, for the blog in which I learned the term logit distillation and for RV for the open weight papers.

Cultural Contrastive Learning

After a long hiatus from this blog, I’m writing about a three-week experiment I did to express a proof of concept methodology for finetuning multimodal embeddings to imitate a human’s engram when one has access to social media platforms, building on ideas from contrastive learning, multimodal foundation models, philosophy, and media theory. The full text can be found here

Modality gap in representation learning is a well-studied problem. While many have definitions of it in measurable benchmarks, most conceptualiza- tions and solutions of the problem focus on style and lower-order concrete semantics. We elucidate a new perspective and class of problems in the space of multimodal representation learning, especially as it pertains to personalization, provide a proof of concept of finetuning a representation space for this problem, and discuss its applications in various generative AI pipelines.

Qualia refers to the subjective, qualitative, and felt experiences of an individual’s conscious experience. Examples include the feeling of pain, the taste of coffee, or the color red as a you, the individual, per- ceive it. Qualia is conditional on an individual’s neural achitecture, so to speak, and the experiences they collect through their life, the particular environments they are embedded in (in the sense of other agents being part of an environment, in the sense of the Sapir Whorf Hypothesis, in the sense of Wittgenstein , and so on. (The Sapir-Whorf hypothesis suggests that language influences thought. This connects to Wittgenstein’s con- cept of language games, where meaning emerges from use within specific contexts.)

In culture (so qualia of a collective group of individuals), such as literature and art, qualia can be described formally as synaesthesia (sensor crossover between modalities), aesthetic affinity (a form of emotional kinship), or semiotics (shared symbolic languages). Other informal words for this might be resonance, evocation, zeitgeist convergence (shared cultural moment expression).

The concept of a modality gap was first introduced in Mind the Gap (Liang et al, 2022) which posits that geometric inductive bias introduced in multimodal embeddings in which uni- modal domains are tokenized and embedded separately creates a modality gap on image and image caption distributions.

A searchable continuous latent space which solves the modality gap lends itself to a multimodal embedding as well as a latent space for user-conditional multimodal generation. We believe that this is an approximation to understanding the phenomenal binding prob- lem, which is about how objects, background objects, as well as abstract and affective features are integrated into a unified experience for an individual.

A motivating application of customizing a CLIP-like latent space is its use in custom text-conditional diffusion pipelines. A custom latent space approach could be complementary to fine-tuning diffusion weights, which focuses more on direct style-transfer like results rather than semantic understand- ing.

Thus far, the representational learning research community has focused on multimodal distributions of (text, image) pairs which are relatively straightforward in their translation. For example, The Mind the Gap paper evaluates models for their geometric gap on the COCO dataset, which contains photos of generic ob- jects (in the same sense that pre-trained DALL- E1 on image, text pairs for which the text appeared in Wikipedia \(> 100\) times), Voyage, evaluates mixed modality search on the distribution (text, image of text), i.e. that the string “Hello world” retrieves an image of “Hello world” rather than string such as “Cat.”

While these evaluations form a baseline of multimodal repre- sentation gap, they still represent relatively simple cross domain transformations. For example, the transformation from image to image caption focuses on the object level, which most if not almost all observers of the image would agree on. And the transformation from text to image of text is as simple as save a pdf with “text” in it. In some sense, it means that these joint distributions have higher mutual information and are easier to learn than an individual’s sensory space.

An individual’s sensory space, on the other hand, is shaped by their histories, experiences, unique biology. Digitally, it is traceable, for example, through a user’s hypertextual space, intentional navigation through the web, manual linking, et cetera. When we learn a custom user adapter downstream of a pre- trained baseline multimodal embeddings model, we are in, some sense, learning this transformation.

The dataset we explore with is a second-degree scrape of a test user’s text-based and image-based semantic space. This is chosen as a joint distribution which is representative of the test user’s qualia (perhaps, in a cultural subspace).

We choose to join Substack and Pinterest space as they represent intentional content discovery platforms in which users explore an inner space such as aesthetics for home design, visualizing futures, or introspective writing with emotional qualities. In particular, these are two domains in which we expect a high frequency of cross- artefact association driven by the user’s intuitive style, or rhizomatic thinking.

In particular, this user-qualia-representative dataset is one in which various forms of a modality gap appear with embedding spaces such as voyage-multimodal-3.

Proof of sublinear regret of UCB algorithms for bandits

This post will be concerned with the proof that the UCB algorithm achieves sublinear regret for multi-armed bandits. First, let’s set up the preliminaries of the problem.

UCB and bandits are part of fast reinforcement learning, a problem setting that is concerned with making sample efficient decisions, especially in applications where experience is costly or difficult to achieve. Bandits are a simplified version of an MDP–in particular one with only one state. We have a finite set of actions \(\mathcal{A}\), each of which induces a reward distribution \(\mathcal{R}^{a}(r) = P[r \vert a]\). At each step, the agent selects some \(a_{t} \in \mathcal{A}\) and the overall goal is to maximize the cumulative reward, \(\sum_{t=1}^T r_{t}\), or equivalently, to minimize the cumulative regret, \(l_{T} = \mathbb{E}[\sum_{t=1}^T V^{*} - Q(a_{t})]\).

Read more

How structured do PCP queries need to be?

Although lower bounds in complexity theory usually refer to impossibility results related to how time or space efficient an algorithm for a class of problems can be, I do think that this question, about the entropy of queries of a PCP, is a sort of lower bound on how oblivious or lazy a verifier can be. Entropy and information gain are important and old concepts in machine learning, but I think that in the age of deep learning, people don’t realize just how wide the concept of a learning algorithm can be stretched. Fields like cryptography often focus on transmitting some very specific, structured knowledge and hide everything else, which is a nice dual perspective to machine learning. Verifiers and extractors are learners, too. Unsurprisingly, techniques from cryptography can show up in machine learning. Alex Irpan has a cool blog post showing how the hybrid argument (which shows up in my proof of the not-uniform-but-still-independent case) appears in imitation learning (RL) proofs.

Read more

Understanding and Implementing Policy Gradients

Policy gradients are a pretty cool class of reinforcement learning algorithms. They focus on directly optimizing the metric we care most about in reinforcement learning (the expected reward from acting in an environment), and because of this enjoy an elegant formulation that looks very similar to supervised maching learning, and has stability benefits over approaches like Q-learning, which indirectly learn a policy, and can suffer from too much approximation.

While this blog post is in no sense comprehensive, I hope to show a good mixture of theory and practice. There’s a lot of

Read more