Multidimensional RoPE

Non-Axial Rotary Positional Embeddings in higher dimensions.

Posted Jul 31, 2025

By nord

2 min read

I was recently part of some cool discussions in an ML research group chat about extending rotary positional embeddings¹ for transformers to higher dimensions. This post highlights some of the ideas that came out of it, the initial idea being that attention scores of rope in 2D should be isotropic, unlike in the widely used Axial RoPE variant². Hence, a non-axial variant for 2D was proposed.

People are calling it Golden RoPE, Uniform RoPE, or IsotRoPE.

Most of the credit goes to jerry, who also wrote a great post about it³.

Kevin Yin wrote a follow-up appendix to jerry’s post, attributing most of the success to incoherent angles instead of angle uniformity. He also expands on finding the optimal angle spacing by formulating it as an energy minimization problem.⁴

nor has a great post on deriving RoPE the proper way with 3 very reasonable constraints⁵:

Relative position dependency: $\langle f(q, p_q), f(k, p_k) \rangle = g(q, k, p_q - p_k)$
Norm preservation: $\|f(q, p_q)\| = \|q\|$
Linearity: $f(q, p_q) = M(p_q)q$

You should read jerry’s blog³ to get the full picture, but the tldr is that given a block-diagonal rotation matrix for $D$-dimensional keys and queries

\[\mathrm{RoPE}_{\boldsymbol{\theta}} = \begin{bmatrix} R_0(\theta_0) & 0 & \cdots & 0 \\ 0 & R_1(\theta_1) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{D/2-1}(\theta_{D/2-1}) \end{bmatrix}\]

we replace the axial rope rotations⁶

\[\begin{cases} R_i(\omega_ix) & \text{if } i \text{ is even} \\ R_i(\omega_iy) & \text{if } i \text{ is odd} \end{cases}\]

with a rotation matrix of the form

\[R_i(\omega_i\langle \boldsymbol{u_i}, \boldsymbol{p}\rangle)\]

$\boldsymbol{u_i} \in \mathbb{R}^2$ are unit vectors evenly spaced around the unit circle by an angle of $\Delta u$ (i.e. $\boldsymbol{u_i} = R_{\Delta u}\boldsymbol{u_{i-1}}$)
$\boldsymbol{p} \in \mathbb{R}^2$ is the position
$\omega_i$ is the $i$-th frequency magnitude

Notice how in the visualization below, a $\Delta u$ of $\frac{\pi}{2}$ appears to give us axial rope, but this is not really true! We are also cycling over negative $x$ and $y$, which axial rope never does.

Since we’re using the same random unit vectors for keys and queries pre-rotation, the $\sin$ cross-term vanishes. Given that, and $\cos$ being symmetric, our $(\pm x,0),\;(0,\pm y)$ cycle simplifies to $(x,0),\;(0,y)$.

So unlike RoPE in practice, we can ignore the squared norm and the $\sin$ term. This is intentional for the visualization.

WARNING: unnecessary notation abuse

i.e. if we look at just a single pair of $q :=\boldsymbol{Q}_{xy}$ and $k :=\boldsymbol{K}_{xy}$, we get: \[ \begin{aligned} q = k \implies \langle R(\theta_q)q,\ R(\theta_k)k \rangle &= \langle R(\theta_q)q,\ R(\theta_q + \Delta\theta)q \rangle \\[6pt] &= q^\top R(\Delta\theta)\,q \\[6pt] &= \|q\|^2 \cos(\Delta\theta) + \underbrace{(q_x q_y - q_y q_x)}_{\text{vanishes}}\sin(\Delta\theta) \\[6pt] &= \|q\|^2 \cos(\Delta\theta) \qquad \text{since $q$ is a unit vector} \\[6pt] &= \cos(\Delta\theta) = \cos(-\Delta\theta) \end{aligned} \]

Play with the spacing to see how the attention scores change. Try $\pi\frac{\sqrt5-1}{2} \approx$ 1.9414⁴.

Resolution

Num Frequencies

Min Freq.

Max Freq.

Direction Spacing (Δu)

Center X

Center Y

Animation Speed

RoPE paper ↩︎
Heo et al. 2024 attributes axial rope to Fang et al. 2023. ↩︎
jerryxiong’s post ↩︎ ↩︎²
Kevin Yin’s post ↩︎ ↩︎²
nor’s post ↩︎
$R_i(\theta)$ is the $2\times2$ rotation matrix acting on dimensions $[2i, 2i+1]$ of the keys and queries. ↩︎

Machine Learning, Transformers

This post is licensed under CC BY 4.0 by the author.

Trending Tags