Multidimensional RoPE
Non-Axial Rotary Positional Embeddings in higher dimensions.
I was recently part of some cool discussions in an ML research group chat about extending rotary positional embeddings1 for transformers to higher dimensions. This post highlights some of the ideas that came out of it, the initial idea being that attention scores of rope in 2D should be isotropic, unlike in the widely used Axial RoPE variant2. Hence, a non-axial variant for 2D was proposed.
People are calling it Golden RoPE, Uniform RoPE, or IsotRoPE.
Most of the credit goes to jerry, who also wrote a great post about it3.
Kevin Yin wrote a follow-up appendix to jerry’s post, attributing most of the success to incoherent angles instead of angle uniformity. He also expands on finding the optimal angle spacing by formulating it as an energy minimization problem.4
nor has a great post on deriving RoPE the proper way with 3 very reasonable constraints5:
Relative position dependency: \(\langle f(q, p_q), f(k, p_k) \rangle = g(q, k, p_q - p_k)\)
Norm preservation: \(\|f(q, p_q)\| = \|q\|\)
Linearity: \(f(q, p_q) = M(p_q)q\)
You should read jerry’s blog3 to get the full picture, but the tldr is that given a block-diagonal rotation matrix for $D$-dimensional keys and queries
\[\mathrm{RoPE}_{\boldsymbol{\theta}} = \begin{bmatrix} R_0(\theta_0) & 0 & \cdots & 0 \\ 0 & R_1(\theta_1) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{D/2-1}(\theta_{D/2-1}) \end{bmatrix}\]we replace the axial rope rotations6
\[\begin{cases} R_i(\omega_ix) & \text{if } i \text{ is even} \\ R_i(\omega_iy) & \text{if } i \text{ is odd} \end{cases}\]with a rotation matrix of the form
\[R_i(\omega_i\langle \boldsymbol{u_i}, \boldsymbol{p}\rangle)\]- $\boldsymbol{u_i} \in \mathbb{R}^2$ are unit vectors evenly spaced around the unit circle by an angle of $\Delta u$ (i.e. $\boldsymbol{u_i} = R_{\Delta u}\boldsymbol{u_{i-1}}$)
- $\boldsymbol{p} \in \mathbb{R}^2$ is the position
- $\omega_i$ is the $i$-th frequency magnitude
Notice how in the visualization below, a $\Delta u$ of $\frac{\pi}{2}$ appears to give us axial rope, but this is not really true! We are also cycling over negative $x$ and $y$, which axial rope never does.
Since we’re using the same random unit vectors for keys and queries pre-rotation, the $\sin$ cross-term vanishes. Given that, and $\cos$ being symmetric, our $(\pm x,0),\;(0,\pm y)$ cycle simplifies to $(x,0),\;(0,y)$.
So unlike RoPE in practice, we can ignore the squared norm and the $\sin$ term. This is intentional for the visualization.
WARNING: unnecessary notation abuse
i.e. if we look at just a single pair of $q :=\boldsymbol{Q}_{xy}$ and $k :=\boldsymbol{K}_{xy}$, we get: \[ \begin{aligned} q = k \implies \langle R(\theta_q)q,\ R(\theta_k)k \rangle &= \langle R(\theta_q)q,\ R(\theta_q + \Delta\theta)q \rangle \\[6pt] &= q^\top R(\Delta\theta)\,q \\[6pt] &= \|q\|^2 \cos(\Delta\theta) + \underbrace{(q_x q_y - q_y q_x)}_{\text{vanishes}}\sin(\Delta\theta) \\[6pt] &= \|q\|^2 \cos(\Delta\theta) \qquad \text{since $q$ is a unit vector} \\[6pt] &= \cos(\Delta\theta) = \cos(-\Delta\theta) \end{aligned} \]Play with the spacing to see how the attention scores change. Try $\pi\frac{\sqrt5-1}{2} \approx$ 1.94144.
Heo et al. 2024 attributes axial rope to Fang et al. 2023. ↩︎
$R_i(\theta)$ is the $2\times2$ rotation matrix acting on dimensions $[2i, 2i+1]$ of the keys and queries. ↩︎