Towards Efficient Intensely Robust Deep-ish Learning

“It is in general that the unexplored attracts us” - Lady Murasaki

Let’s explain the main equation from the Madry paper and then unfold that into how adversarial training is actually done in practice.

We will lift a concept from convex optimization to help geometrically interpret what is going on with adversarial deep learning.

First, here is the main equation from Madry et al, let’s call it the saddle point equation.

Saddle Point Equation

\[min_{\theta} \mathbb{E}_{(x,y) \sim \mathcal{D}} [max_{\delta \in \mathcal{S}} L(\theta, x + \delta, y)]\]

Notice that this is a min max problem. Most effective attacks are whtie box, meaning the “attacker” has access to whatever is necessary to differentiate \(L\) with repsect to the neural network parameters.

But before getting into the details of inner maximizers, let’s dive into the convex optimization.

Strong Duality

Recall that strong duality occurs when the primal function (\(min f_{\theta}(x)\)) is convex and the constraints are convex. Sometimes, there are other routes to proving that a primal function is convex, including Slater’s conditions. Sometimes, there is a connection to the primal function’s polynomial time computability. But for now, let’s just see how strong duality connects to the saddle point equation.

When strong duality holds, the primal and dual functions are solved at the same points, i.e. if \(L(x,\lambda)\) is the Langrangian then \(\inf \sup L(x, \lambda) = \sup \inf_{x} L(x, \lambda)\) where \(\inf_{x} L(x, \lambda)\) is the definition of the dual. Recall that the LHS is the form of the saddle point equation. Here’s an illustration of strong and weak duality. (Thank you, google docs drawing function!)

strong duality

According to the Madry paper, Daskin’s theorem says that you can simply compute the outer loss function at maximizers of the inner function. That means that making some sort of assumption similar to the assumption made by strong duality in the context of convext optimization means that you can just compute adversarial data points statically aka adversarially robust deep learning is just data augmentation. What’s annoying is that the assumptions of outer function smoothness don’t even hold since neural networks are not continuous due to their ReLU and other discontinuous components, but basically, you can think of them as continuous for the purposes of ADL.

What’s remarkable is that, according to Intriguing Properties of Neural Networks, adversarial examples transfer across architectures and training hyper parameters for a given distributional learning task.

I wanted to make it fancy like min max min max min max, but that’s probably what happens if you do the pipeline iteratively, and also what gans (!!) do.

Now that we’ve imagined the pipeline, let’s get into some adversaries. Yum.

The projected gradient descent (PGD) from Madry is weird the equation is

\[x^{t+1} = \prod_{x + \mathcal{S}} (x^{t} + \alpha sign (\nabla_{x} L(\theta, x, y)))\]

I was so confused about why the iterative attack is multiplicative, since most gradient ascenders or descenders I’ve encountered like e.g. SGD (stochastic gradient descent) are additive. A quick search on the internet suggested that the form could enforce variables to be positive, so as a circuit, it is like the AND operation.

Another example is the papernot attack.

In general, adversarial examples can be found within any \(\mathcal{S}\), i.e. any L-norm can be used as a stopping condition for iterative attacks.

For example, Papernot’s attack as written is done with the L-0 norm, aka count the number of pixels that are changed. The algorithm masks some subset of pixels at each iteration. How? By differentiating the loss function wrt the input vector. So basically, treating Papernot’s as a black box algorithm, to generate adversarial examples, start with some image and set any wrong classification label, and run the algorithm to produce image perturbations close to the original image.

The route to ADL started from model distillation, which I’m curious about since in general I’m interested in efficient ways to train models.

For a basic intuition of model distillation, we can think about a cross entropy loss function. For classification tasks, \(\sum_{p} p \log q\) pushes the correct class logit higher. In distillation, the KL divergence term \(\sum_{p} \log \frac{q}{p}\) is computed with \(p\) being the student distribution and \(q\) being the teacher distribution, so in particular the network learns for example that St. Bernard is more similar to Dalmation than either are to bird.

Sometimes, adversarial robustness doesn’t transfer to the student through distillation, and so ARD is a version of distillation where given an adversarial dataset for the learning task, distillation is done by matching teacher logits within \(\epsilon\)-radius of training samples. So KL matching is done also with the datapoints of \(\mathcal{S}\).

More

  1. Adversarialy Robust Distillation
  2. Towards Deep Learning Models Resistant to Adversarial Attacks
  3. Intriguing Properties of Neural Networks
  4. Madry and Kolter Adversarial Robustness Tutorial
  5. Boyd and Vandenberghe Convex Optimization
  6. Duality Gap, Computational Complexity and NP Completeness: A Survey

Lifting the Conceptual Bludgeon Off On and Off Policy Reinforcement Learning

I started adding deadlifting back to my gym routine too!

A Stab at the Definition of On and Off Policy Reinforcement Learning

I wrote this from my heart:

In the RL paradigm, there is an agent that is trained on the data it produces. Some people call this learning from trial and error. Sometimes, the data generating distribution (which be produced by the reference policy, the actor so to speak) is distributionally different than the policy being trained (sometimes called the target policy)

Why is this bad?

There are many reasons why one would train with the on-policy rl paradigm. One of them is stable optimization, meaning that the policy converges to the intended or optimal distribution, something which importance weighting tries to do, by re-weighting empirical data so that the estimate of the expected return, part of the optimization target, is computed more accurately, or is an unbiased estimator.

The goodness of on-policyness can even be linked to capacity efficiency, i.e. why approximating with low rank adapaters and reinforcement learning on the same dataset achieves the same TestNLL, or some measure of generalization error, as full capacity SFT.

But isn’t this kind of ambiguous?

Let’s say that you have an inference and training setup in which the inference server does \(B\) episodes/trajectories then collects the data and does a weight update, then broadcasts the new weights to the inference servers. Then during the weight update, if the optimization batch size is \(b << B\), then essentially off policy RL is happening. The batch of \(B\) episodes/trajectories is basically a replay buffer.

In other words, in the wild, foundation models are usually distributed across GPUs (which can be in a given GPU cluster or node or across nodes), so in particular inference is on a different computer than training. There are a bunch of things to optimize in such a setup, but for the wild (the wild on-policy RL!), we’d like to maximize GPU utilization while staying as on-policy as possible (who knows what unit that is measured in? I would really like to know). In the paradigm of in flight updates, after each optimizer (optim.step() in your torch or whatever code) step, updated weights are broadcast to the inference server. This is really cool. Would be even cooler to customize the step delay to broadcast. In some sort of limit, some sort of repeated off-policy rl is basically on-policy rl!

Btw, here’s how to do importance weighting

To fix the off-policyness, i.e. when estimating \(\mathbb{E}_{\pi}[R]\) (aka the expected return), where the expectation is integrated or summed with \(\pi\) with data generated by a different behavior policy \(\mu\), multiply the observed reward / return by this ratio \(\frac{\pi(a|s)}{\mu(a|s)}\). Intuitively, it reweights the implicity \(\mu\) which informs the actual data that is generated by the actor. In other words, it corrects for the probability term in the expectation.

Some Example of On and Off Policy in a Real World Scenario, Princess Version

I love Waymo. I think I actually feel safer in Waymos. The other day I scheduled a Rivian test drive because I was SO BORED. Rivian is like an electric version of Range Rover. I love Range Rovers too.

I was thinking through some real world examples of distributional challenges and reinforcement learning!

Let’s think about training a Waymo, but holding all other agents constant (I know, BAD!) as some factorization of the environment and reward, yay partially observable MDPs!

What I really want to get at here are a bunch of different concepts in a real world application, where we want to deploy a safe and accurate policy. How I imagine training a Waymo would go. Let’s say you have historical data \(D_{H}\) from a human driving a car in the format \(s, a, r, s, ...\). You can train an initial policy via behavior cloning / imitation learning / SFT on a base model.

Depending on the human driver, this may be a conservative or a risky policy. Waymos deployed in the real world are actually very conservative in the sense that they don’t run red lights, they stop for pedestrians, et cetera. However, we would like to train a policy which learns to act in risky scenarios, to more fully cover the \((state, action)\) distribution, and learns via reward / trial and error.

After training an initial policy via behavior cloning, one may tweak the conservativeness of the policy (via things like temperature (if doing a softmax policy), entropy bonus in the on-policy optimization equation, I would love to know more) during a simulated test drive. This is when there are still humans in the Waymo car (rlhf!), during the on-policy simulated driving training session, for the (s,a) coverage. This is in some sense a risky policy which gets direct human feedback.

To shift a risky policy back to being more conservative, you could probably train it off-policy via more conservative \(s,a,r,s...\).

More soon.

from MCMC to Variational Inference

In which this became an exercise in deriving the closed form gaussian KL expression from Auto-Encoding Variational Bayes!

Almost all observed data is the result of some process with hidden latent factors. Bayesian analysis provides a recipe for learning from that data \(x\) and the unknown latent variables \(z\).

1. Specify a prior \(p(z)\) quantifying what is known about \(z\) before any data is observed
2. Learn a likelihood function \(p(x \mid z)\), or decoder
3. Apply Bayes’ rule \(p(z|x) = \frac{p(x|z)p(z)}{\int_z p(x|z)p(z) dz}\) to learn the posterior distribution, which describes what is known about \(z\) after observing the data \(x\)

The issue is that computing \(p(z \mid x)\) is not feasible for a number of reasons because marginalization in the denominator becomes computationally intractable when the variables become high dimensional. In other words, uncertainty is expensive, and approximate inference methods all get at finding \(p(x)\) without integrating over all configurations of latents.

Markov Chain Monte Carlo (MCMC) methods and Variational Bayesian methods differ in whether they explicitly model \(p(z \mid x)\), the decoder or recognition model.

The MCMC paradigm is to sample \(z_0 \sim p(z)\) and apply a transition operator \(q(z_t \mid z_{t-1}, x)\) until \(z_T\) is a random variable which converges to the posterior \(p(z \mid x)\). The VAE methodology parameterizes and learns \(p(z \mid x)\).

One particular version of MCMC is the Metropolis-Hastings algorithm. The Metropolis-Hastings algorithm can be seen as a random walk which gets closer to \(p(z \mid x)\). The initial sample is drawn from a standard Gaussian and the particle moves to another state (generated by adding noise to the current sample) depending on whether the new state is higher in probability. In the implementation for this blog post, the sample is decoded and the reconstructed data is compared to the actual data \(x\). After a given number of steps, the binary cross entropy loss between the current reconstruction and the data point is backpropagated to train the decoder.

The major inefficiency in Metropolis Hastings is the random walk, which uses the same transition matrix throughout the algorithm. One wonders, after seeing some training data, isn’t there a more efficient way of approximating the latent posterior?

Among other methods which use information from training to guide sampling, variational inference does so by explicitly learning a parameterization of the encoder \(q(z \mid x)\), also called a recognition model. Assuming the encoder is a Gaussian, we can model it with a neural network which outputs \(\mu, \sigma^2\) given \(x\) as input.

Starting with the expression for the KL divergence between the learned model \(q_{\phi}(z \mid x)\) and the true posterior \(p(z \mid x)\), we can derive the evidence lower bound (ELBO) which is a lower bound on \(p(x)\) and an objective that through maximizing we can obtain an estimate for \(p(x)\), converting the inference problem into an optimization problem.

The derivation is

\[\begin{align*} D_{KL}(q_{\phi}(z|x) \| p(z | x)) &= \int_z q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz \\ &= - \int_z q_{\phi}(z|x) \log \frac{p(z|x)}{q_{\phi}(z|x)} dz \\ &= - \int_z q_{\phi}(z|x) \log \frac{p(z,x)}{q_{\phi}(z|x)p(x)} dz \\ &= - \left(\int_z q_{\phi}(z|x) \log \frac{p(z,x)}{q_{\phi}(z|x)} dz - \int_z q_{\phi}(z|x) \log p(x) dz \right) \\ &= - \int_z q_{\phi}(z|x) \log \frac{p(z,x)}{q_{\phi}(z|x)} dz + \log p(x) \end{align*}\]

So

\[\log p(x) = \mathcal{L} + D_{KL}(q_{\phi}(z|x) \| p(z | x))\]

Since \(D_{KL}\) is non-negative, \(\mathcal{L}\) is a lower bound on \(\log p(x)\).

We can then write \(\mathcal{L}\), the ELBO, as

\[\begin{align*} \mathcal{L} &= \int_z q_{\phi}(z|x) \log \frac{p_{\theta}(z,x)}{q_{\phi}(z|x)} dz \\ &= \int_z q_{\phi}(z|x) \log \frac{p_{\theta}(x|z) p(z)}{q_{\phi}(z|x)} dz\\ &= \mathbb{E}_{z \sim q_{\phi}(z|x)}[p_{\theta}(x|z)] - D_{KL} (q_{\phi}(z|x) \| p(z)) \end{align*}\]

Analytic Integral of the KL Divergence of two Gaussians

The ELBO contains a \(- D_{KL} (q_{\phi}(z|x) \| p(z))\) term. We can integrate this expression analytically with a combination of algebra, properties of integrating probability distributions and the trace trick for expectations of quadratic forms.

First, let’s be explicit about the expressions for two PDFs:

\[q(z) = \mathcal{N}(z; \mu, \sigma^2) = \frac{1}{\sqrt{(2\pi)^J |\Sigma|}} \exp\left(-\frac{1}{2} (z - \mu)^T \Sigma^{-1} (z - \mu)\right)\] \[p(z) = \mathcal{N}(z; 0, 1) = \frac{1}{\sqrt{(2\pi)^J |I|}} \exp\left(-\frac{1}{2} (z - \mu)^T I (z - \mu)\right)\]

Multivariate Gaussian Facts

1. \(\text{Cov}(z) = \mathbb{E}_{z}[(z - \mu)(z - \mu)^T] = \Sigma\) where \(\Sigma\) is the covariance matrix
2. \(\mathbb{E}[z] = \mu\)

Trace Trick for Expectations of Quadratic Forms

Let \((z-\mu)^T A (z - \mu)\) be the quadratic form.

1. A quadratic form is a scalar, so it is its own trace \(\mathbb{E}[(z - \mu)^T A (z - \mu)] = \mathbb{E}[\text{tr}((z - \mu)^T A (z - \mu))]\)

2. Cyclic property of trace means \(\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)\)

3. Linearity of expectation through the trace operator

The Derivation of Closed Form KL Divergence of Two Gaussians

The overall structure was:
1. Notice that \(D_{KL} (q(z|x) \| p(z)) = \int q(z)(\log p(z) - \log q(z)) dz\) which decomposes to \(\int q(z) \log p(z) dz\) and \(\int q(z) \log q(z) dz\)
2. Compute \(\int q(z) \log p(z) dz\) and \(\int q(z) \log q(z) dz\) separately and add them back together

To calculate \(\int q(z) \log q(z) dz\):
1. Simplify \(\log q(z)\)
2. Distribute \(\int q(z)\)

First, write the \(\log\) of \(q(z)\):

\[\log q(z) = \log \left(\frac{1}{\sqrt{(2\pi)^J \prod_{j=1}^J \sigma_j^2}}\right) - \frac{1}{2}(z - \mu)^T \Sigma^{-1}(z - \mu)\]

Now, distribute \(\int q(z)\):

\[\begin{align*} \int_{z} q(z) \log q(z) dz &= \log \left(\frac{1}{\sqrt{(2\pi)^J \prod_{j=1}^J \sigma_j^2}}\right) \int_{z} q(z) dz - \frac{1}{2} \mathbb{E}_{z} [(z - \mu)^T \Sigma^{-1} (z - \mu)] \\ &= \log 1 - \log \left((2\pi)^{J/2}(\prod_{j=1}^J \sigma_{j}^2)^{1/2}\right) - \frac{1}{2} \mathbb{E}_{z} [\text{tr}(\Sigma^{-1} (z - \mu) (z - \mu)^T)] \\ &= - \frac{J}{2} \log(2\pi) - \frac{1}{2} \sum_{j=1}^{J} \log \sigma_{j}^2 - \frac{1}{2} \text{tr}(\mathbb{E}_z[\Sigma^{-1} \Sigma]) \\ &= - \frac{J}{2} \log(2\pi) - \frac{1}{2} \sum_{j=1}^{J} \log \sigma_{j}^2 - \frac{J}{2} \\ &= -\frac{J}{2} \log(2\pi) - \frac{1}{2} \sum_{j=1}^J (1 + \log \sigma_{j}^2) \end{align*}\]

The computation of \(\int q(z) \log p(z) dz\) follows a similar approach!

stochastic parrots of post raining 🌧️ ☔️ post training reinforcement learning reasoning

What sparked my curiousity? The long reflection tokens from the m1 paper which sent me reeling in a chat with my friend. Then I read another paper, the Qwen3, where they talked about thinking and non thinking mode fusion, and I was like, is that really that difficult? It seems you can mix and match any sequence of domain-specific foundation model rollouts that by appending transition tokens which functionally are similar to conditional logic and, or, where, etc …, fine-tune on those rollouts, rinse and repeat.

The long reflection tokens from the m1, ‘However, Recheck, Wait, Aha’ were apparently very important tokens for the reasoning paths for stabilizing entropy of the learned policy, which is apparently important for downstream reasoning rl performance. To preserve these tokens, which, due to their low \(\pi_{ref}\) weight in the denominator of the \(IS\), which creates a high \(\dfrac{\pi_{cur}}{\pi_{ref}}\) for the advantage (\(\sum_{i=1}^t r_i - V\)), were clipped out of the PPO (and GSPO and GRPO) updates, the CIPSO objective from the m1 paper simply adds a stop gradient operation, so the high IS = \(\dfrac{\pi_{cur}}{\pi_{ref}}\) term does not explode the gradient update in the chain rule of backpropagation, and tokens (‘However, Recheck, Wait, Aha’) are assigned credit in the update to increase their log_proba.

PPO Objective Function Minus the KL Term

\(J(\theta) = \mathbb{E}[\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} min(r_{i,t}A_{i,t}, clip(r_{i,t}, 1-\epsilon, 1+\epsilon)A_{i,t}) ]\)

GRPO Objective function

\[J(\theta) = \mathbb{E}[\sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} min(r_{i,t}A_{i,t}, clip(r_{i,t}, 1-\epsilon, 1+\epsilon)A_{i,t}) ]\]

CISPO Objective function

\(J(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i=1}^{G} |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} sg(r_{i,t})A_{i,t} \log \pi_{\theta}(o_{i,t} \mid q, o_{i,\text{prev}}) M_{i,t} \right]\)

where \(M_{i,t} = 0\) if \(A_{i,t} > 0\) and \(r > 1 + \epsilon_{high}\), \(M_{i,t} = 0\) if \(A_{i,t} < 0\) and \(r < 1 + \epsilon_{low}\) and \(1\) otherwise.

The main insight from PPO to GRPO, the “group part”, is that one may approximate the baseline term (V) in an advantage computation \(A = R - V\) with the average summed return of the group rollouts, the many completions for a fixed prompt, the (Prover, Verifier pairs).

The main insight from GRPO to CISPO is that transforming the clipping of \(r_{i,t}A_{i,t}\) into a mask allows one to inspect \(r_{i,t}\) and \(A_{i,t}\) and mask them out if both are over or under flow. I assume the ‘Aha’, ‘Wait!’ tokens have consistently high \(A_{i,t}\) but decreasing \(r_{i,t}\) as the policy learning weights them higher over time.

I shouted to S.Z., why was the entire section about the long reflection tokens in the m1 paper necessary??? I wonder if one studied the long reflection tokens from the m1 papers in Euclidean Space post post-training they would belong in the same subspace. Why am I always becoming an interpretability girlie, against my better judgment?! She said, I wonder about the false positives for the reflection tokens? I said, I can imagine some trigger words. (…) Who would have thought that one could simply inspect conditional probabilities?

The KL term, these days, is left out of most reasoning model objective functions as policies deviate wildly from the reference policy anyway. I shouted to A.L., the rollout completion length penalty in section 2.2.3 of the magistral paper makes no sense! Why not just append a STOP_THINKING token after correct reasoning traces and move this upstream to the Long CoT cold start behavior imitation stage of the post raining post training reasoning rl pipeline? I finally figured out how to do las vegas algorithms with foundation models, I had been thinking forever. Computability theory, my OG.

Anyway the point of this above sections was to point out the similarity between the fork in the road reflection tokens in the reasoning rollouts and the thinking fusion mode from Qwen3, at different layers of abstraction.

Entropy in the RL Reasoning Phase

Papers such as this one talk about the importance of the entropy of the optimized policy for downstream performance, fitting the equation \(R = -a * e^{H} + b\). The downstream evaluations are datasets such as OMNI-BENCH, AIME 2024, et cetera. (Recall that foundation models have a log_proba or proba function which allows you to compute \(H = - p \sum \log p\)).

There are several ways to optimize for entropy of a reasoning RL policy

  1. Entropy Bonus in the objective function

I imagine this is adding \(- \beta H(x)\) in the objective function.

  1. \(\epsilon_{high}\) in the clip function of a GRPO objective

This is what magistral does, they say they that entropy bonus of method 1 causes instability. The basic modification to the GRPO function is to adjust the clipping threshold.

\[J(\theta) = \mathbb{E}[\frac{1}{|o_i|}] \sum_{t=1}^{|o_i|} min(r_{i,t}A_{i,t}, clip(r_{i,t}, 1-\epsilon_{low}, 1+\epsilon_{high})A_{i,t})\]

\(\epsilon_{high}\) allows for \(\pi_{cur}\) to deviate from \(\pi_{ref}\), which adjusting the $\beta$ term on a KL penalty could do.

  1. Converting the clip into a stop_gradient and masking in the CISPO objective

The stop_gradient is implement by returning None in the backward pass of an autodifferentiation graph, treating the variable like a constant. The difference with clipping is that the high IS term \(r_{i,t}\) is part of the loss computation and weights the fork in the road tokens which contribute to high entropy (exploration) in the loss function accordingly. I’m not sure what the m1 paper means by ‘clipped_out’ as a token, my guess is that the token’s log_proba needs to be high enough relative to the non exploration tokens. I’m curious about this.

I wonder, how much entropy is too much entropy? I wish the entropy mechanism paper had just reported on an exact value of \(H\), if it exists, for the downstream tasks. I suppose the predictive equation is more flexibile as tasks become ever more computationally difficult.

KL Distillation

They took out the KL term in the RL phase of the pipeline (recall that the m1 and the magistral and probably a few other papers do not have a KL term in the RL objective)! But KL distillation, ah the good teacher and the professor forcing (so to speak, those are the titles of the papers!) can be used to train lightweight smaller models from rollouts from the heavyweight models (trained through the (pretraining -> Long CoT cold start -> reasoning RL -> thinking mode fusion -> general RL) pipeline). It seems inefficient that one must have log_proba access to the heavyweight model logits to compute this update, i.e. to do logit distillation.

Thank you to SZ for amplifying my inspiration with random words like rollouts and transformers, for the blog in which I learned the term logit distillation and for RV for the open weight papers.

Cultural Contrastive Learning

After a long hiatus from this blog, I’m writing about a three-week experiment I did to express a proof of concept methodology for finetuning multimodal embeddings to imitate a human’s engram when one has access to social media platforms, building on ideas from contrastive learning, multimodal foundation models, philosophy, and media theory. The full text can be found here

Modality gap in representation learning is a well-studied problem. While many have definitions of it in measurable benchmarks, most conceptualiza- tions and solutions of the problem focus on style and lower-order concrete semantics. We elucidate a new perspective and class of problems in the space of multimodal representation learning, especially as it pertains to personalization, provide a proof of concept of finetuning a representation space for this problem, and discuss its applications in various generative AI pipelines.

Qualia refers to the subjective, qualitative, and felt experiences of an individual’s conscious experience. Examples include the feeling of pain, the taste of coffee, or the color red as a you, the individual, per- ceive it. Qualia is conditional on an individual’s neural achitecture, so to speak, and the experiences they collect through their life, the particular environments they are embedded in (in the sense of other agents being part of an environment, in the sense of the Sapir Whorf Hypothesis, in the sense of Wittgenstein , and so on. (The Sapir-Whorf hypothesis suggests that language influences thought. This connects to Wittgenstein’s con- cept of language games, where meaning emerges from use within specific contexts.)

In culture (so qualia of a collective group of individuals), such as literature and art, qualia can be described formally as synaesthesia (sensor crossover between modalities), aesthetic affinity (a form of emotional kinship), or semiotics (shared symbolic languages). Other informal words for this might be resonance, evocation, zeitgeist convergence (shared cultural moment expression).

The concept of a modality gap was first introduced in Mind the Gap (Liang et al, 2022) which posits that geometric inductive bias introduced in multimodal embeddings in which uni- modal domains are tokenized and embedded separately creates a modality gap on image and image caption distributions.

A searchable continuous latent space which solves the modality gap lends itself to a multimodal embedding as well as a latent space for user-conditional multimodal generation. We believe that this is an approximation to understanding the phenomenal binding prob- lem, which is about how objects, background objects, as well as abstract and affective features are integrated into a unified experience for an individual.

A motivating application of customizing a CLIP-like latent space is its use in custom text-conditional diffusion pipelines. A custom latent space approach could be complementary to fine-tuning diffusion weights, which focuses more on direct style-transfer like results rather than semantic understand- ing.

Thus far, the representational learning research community has focused on multimodal distributions of (text, image) pairs which are relatively straightforward in their translation. For example, The Mind the Gap paper evaluates models for their geometric gap on the COCO dataset, which contains photos of generic ob- jects (in the same sense that pre-trained DALL- E1 on image, text pairs for which the text appeared in Wikipedia \(> 100\) times), Voyage, evaluates mixed modality search on the distribution (text, image of text), i.e. that the string “Hello world” retrieves an image of “Hello world” rather than string such as “Cat.”

While these evaluations form a baseline of multimodal repre- sentation gap, they still represent relatively simple cross domain transformations. For example, the transformation from image to image caption focuses on the object level, which most if not almost all observers of the image would agree on. And the transformation from text to image of text is as simple as save a pdf with “text” in it. In some sense, it means that these joint distributions have higher mutual information and are easier to learn than an individual’s sensory space.

An individual’s sensory space, on the other hand, is shaped by their histories, experiences, unique biology. Digitally, it is traceable, for example, through a user’s hypertextual space, intentional navigation through the web, manual linking, et cetera. When we learn a custom user adapter downstream of a pre- trained baseline multimodal embeddings model, we are in, some sense, learning this transformation.

The dataset we explore with is a second-degree scrape of a test user’s text-based and image-based semantic space. This is chosen as a joint distribution which is representative of the test user’s qualia (perhaps, in a cultural subspace).

We choose to join Substack and Pinterest space as they represent intentional content discovery platforms in which users explore an inner space such as aesthetics for home design, visualizing futures, or introspective writing with emotional qualities. In particular, these are two domains in which we expect a high frequency of cross- artefact association driven by the user’s intuitive style, or rhizomatic thinking.

In particular, this user-qualia-representative dataset is one in which various forms of a modality gap appear with embedding spaces such as voyage-multimodal-3.