We’ll first go over RoPE for the classic case of language modeling with an input tensor of shape (batch, seq_len, embedding_dim) and then extend to the case for multimodal transformers.
Fix a position , this is the token position in the sequence.
Then each pair (along the embedding dimension) has a rotation matrix
Where
Since adjacent indices are paired together the rotation matrix is a block diagonal.
So for each position in the sequence, we’ll create a version of this matrix.
Ok, now we’re ready to move onto rotary positional embeddings. Consider an input with a height, width, and a time dimension. Then for each token at a coordinate (t,h,w) there are positions for time, for height, and for width. So the rotation matrix ends up looking (conceptually) like a gigantic block matrix. In practice, we can loop over each pair of coordinates and rotate them—this is much more computationally efficient.