RoPE Embeddings (Multidimensional)

We’ll first go over RoPE for the classic case of language modeling with an input tensor of shape (batch, seq_len, embedding_dim) and then extend to the $3 D$ case for multimodal transformers.

Fix a position $n$ , this is the token position in the sequence.

Then each pair $q_{2 i}, q_{2 i}$ (along the embedding dimension) has a rotation matrix

$[cos (n \cdot θ_{i}) s in (n \cdot θ_{i}) - s in (n \cdot θ_{i}) cos (n \cdot θ_{i})] [q_{2 i} q_{2 i + 1}]$

Where $θ_{i} = \frac{1}{1000 0 ^{(2 i / d)}}$

Since adjacent indices are paired together the rotation matrix is a block diagonal.

$R (n θ_{0}) R (n θ_{1}) ⋱ R (n θ_{n /2}) q_{1} q_{2} ⋮ q_{n /2}$

So for each position in the sequence, we’ll create a version of this matrix.

Ok, now we’re ready to move onto $3 D$ rotary positional embeddings. Consider an input with a height, width, and a time dimension. Then for each token at a coordinate (t,h,w) there are $d_{t}$ positions for time, $d_{h}$ for height, and $d_{w}$ for width. So the rotation matrix ends up looking (conceptually) like a gigantic block matrix. In practice, we can loop over each pair of coordinates and rotate them—this is much more computationally efficient.

$R (n θ_{0}) R (n θ_{1}) ⋱ R (n θ_{n /2}) R (n θ_{0}) R (n θ_{1}) ⋱ R (n θ_{n /2}) R (n θ_{0}) R (n θ_{1}) ⋱ R (n θ_{n /2})$

Ann He

Explorer

RoPE Embeddings (Multidimensional)

Graph View

Backlinks