Position Embeddings: Sinusoidal, RoPE

Position embeddings are important in NLP (transformer), CV (diffusion), and others. The Sinusoidal embedding introduced in transformer is quite mysterious and I hope to understand it better. I partly follow derivations in RoPE.

In transformer, say there are $V$ words in total, and $w_i\in \mathbb{R}^D$ is the embedding for word $i$. An input of shape $B\times T$ (batch and length) becomes $B\times T\times D$ after embedding and is then fed into attention. Now there’s nothing special about words’ positions, but this info is too important to ignore.

Suppose $D=2d$, we have an absolute position embedding $\phi_t(x)\in \mathbb{C}^d$ for position $t$, and $\phi_t$ is shared by both queries and keys. We hope the inner products of embeddings should contain relative positional infomation.

Given a single query and key $\mathbf{q},\mathbf{k}\in \mathbb{R}^D$, their embeddings should satisfy

\[\langle \phi_t(\mathbf{q}), \overline{\phi_s(\mathbf{k})}\rangle = \psi(\mathbf{q},\mathbf{k},t-s)\]

for some $\psi$.

Naturally, we assume $\phi_0(z)=z.$ Then

\[\langle \phi_t(z), \overline{\phi_t(w)}\rangle =\psi(z,w,0)=\langle \phi(z,0), \overline{\phi(w,0)}\rangle=\langle z,\bar w\rangle.\]

$\phi_t$ is isometry on $\mathbb{C}^n$, and thus $\phi_t(z)=U_tz$ for some unitary matrix $U_t$ (ignoring translation).

\[U_{t-1}^*U_t=\sum_{i,j}\langle U_t e_j,\overline{U_{t-1}e_i}\rangle e_ie_j^\top = \sum_{i,j}\psi(e_j,e_i,1) e_ie_j^\top =:\Theta,\]

which is independent of $t$. So,

\[U_t = U_0^*U_t = (U_0^*U_1)(U_1^*U_2)\cdots(U_{t-1}^*U_t) = \Theta^t.\]

We are free to choose $\Theta$ as long as it’s unitary.

RoPE uses diagonal $U_t.$ Similar to Sinusoidal, RoPE considers angles of the form $\theta^{k/d}$:

\[\Theta = {\rm diag}(\mathrm{e}^{\mathrm{i}\theta^{0/d}}, \mathrm{e}^{\mathrm{i}\theta^{1/d}},\cdots,\mathrm{e}^{\mathrm{i}\theta^{(d-1)/d}}),\quad \phi_t(z)=\Theta^tz\]

where $\theta=10^{-4}$ or $1/T$ in general, where $T$ is the max horizon/timestep.

For Sinusoidal, $0\le j< d=\frac{D}{2},0\le t<T,$

\[{\rm PE}(t)_{2j} = \sin(t \theta^{\frac{j}{d}}),\quad {\rm PE}(t)_{2j+1} = \cos(t \theta^{\frac{j}{d}}).\]

In complex form,

\[\widetilde{\mathrm{PE}}(t) = \left[\mathrm{e}^{\mathrm{i}t\theta^{0/d}},\cdots, \mathrm{e}^{\mathrm{i}t\theta^{(d-1)/d}}\right] ={\rm diag}(\Theta^t).\]

The difference is that in transformer, inputs of attention are token embeddings + position embeddings, while RoPE multiplies $\Theta^t$ with queries and keys.

Decay estimates

Why $\theta^{j/d}?$ We show in the following that the embedding corresponds to oscillating integrals creating plenty of inhomogeneity. Also, tokens far away are not too relevant:

$ \langle \phi_{t_0+t}(\mathbf{q}), \overline{\phi_{t_0}(\mathbf{k})}\rangle\to 0$ as $t\to\infty$.

I modified the derivations of RoPE slightly.

\[\begin{align*} \langle \phi_t(\mathbf{q}), \overline{\phi_0(\mathbf{k})}\rangle & = \sum_{j=0}^{d-1} \mathrm{e}^{\mathrm{i}t\theta^{j/d}}q_j\bar k_j \approx \int_0^1 \mathrm{e}^{\mathrm{i}t\theta^{s}}f(s)\,\mathrm{d}s = - \frac{1}{\log\theta}\int_\theta^{1} \mathrm{e}^{\mathrm{i}tu}f(u)\frac{\mathrm{d}u}{u} \end{align*}\]

for some function $f(s)$, where $u=\theta^s.$ Let

\[\mathrm{Ei}(z) := -\int_{-z}^\infty \frac{\mathrm{e}^{-s}}{s}\,\mathrm{d}s,\quad E(u):=\mathrm{Ei}(\mathrm{i}tu).\]

Then $E'(u)=\mathrm{e}^{-\mathrm{i}tu}/u$ and

\[\begin{align*} \int_\theta^1\mathrm{e}^{\mathrm{i}tu}f(u)\frac{\mathrm{d}u}{u} =\int_\theta^1 E'(u)f(u)\,\mathrm{d}u = fE\Big|_{\theta}^1 - \int_\theta^1 f'E. \end{align*}\]

We look at Ei and consider the so-called Poincaré expansion by integrating by parts repeatedly:

\[\begin{align*} \mathrm{Ei}(z) &:= -\int_{-z}^\infty \frac{\mathrm{e}^{-s}}{s}\,\mathrm{d}s = \frac{\mathrm{e}^{z}}{z} + \int_{-z}^\infty \frac{\mathrm{e}^{-s}}{s^2}\,\mathrm{d}s = \cdots =\frac{\mathrm{e}^{z}}{z}\sum_{k\ge 0} \frac{k!}{z^k}. \end{align*}\]

The convergence is uniform over $\Omega_\delta:=\{z: | \arg(z) | \le \pi-\delta\}$ (as ChatGPT claimed).

In particular, as $\mathrm{i}tu\in \Omega_{\pi/2},$

\[E(u)=\mathrm{Ei}(\mathrm{i}tu) \approx \frac{\mathrm{e}^{\mathrm{i}tu}}{\mathrm{i}tu}\to 0\]

uniformly for $u\in[\theta,1]$ and the decay order is $O(1/t).$

Thus, if $\mathbf{q},\mathbf{k}$ are bounded,

\[\langle \phi_t(\mathbf{q}), \overline{\phi_0(\mathbf{k})}\rangle = O(1/t)\]

as $t\to \infty$.