Simple math to speed up GDN prefill

14 Mar, 2026

In this short note we will briefly derive a helpful identity to speedup GDN prefill algorithm. In my simple Torch implementation of the algorithm that gave already a good speedup of about 18%. I assume for custom CUDA C++ kernel the gains can even be more pronounced.

Please read my previous post on GDN for background.

Reminder

The state transition for GDN is as follows:

S_{t} = S_{t - 1} (α_{t} (I - β_{t} k_{t} k_{t}^{T})) + β_{t} v_{t} k_{t}^{T} \in 𝐑^{d_{v} \times d_{k}} .

Define the cumulative gate

γ_{[t]}^{r} = \prod_{j = 1}^{r} α_{[t]}^{j}

and for $1 \leq i \leq r$

Γ_{[t]}^{r, i} = \frac{γ_{[t]}^{r}}{γ_{[t]}^{i}} = \prod_{j = i + 1}^{r} α_{[t]}^{j} .

We obtained the following chunkwise transition rule

S_{[t + 1]} = {\vec{S}}_{[t]} + ({\tilde{U}}_{[t]} - {\overset{\leftarrow}{W}}_{[t]} S_{[t]}^{T})^{T} {\vec{K}}_{[t]} \in 𝐑^{d_{v} \times d_{k}} .

We had (up to a decay factor)

W_{[t]} = (I + tril (B_{[t]} K_{[t]} K_{[t]}^{T}, - 1))^{- 1} B_{[t]} K_{[t]} .

and

{\tilde{U}}_{[t]} = (I + {\tilde{L}}_{[t]})^{- 1} B_{[t]} V_{[t]} = (I + tril (B_{[t]} (Γ_{[t]} ⊙ K_{[t]} K_{[t]}^{T}), - 1))^{- 1} B_{[t]} V_{[t]} \in 𝐑^{C \times d_{v}} .

We see that there are two inverts involved these computations. Invert is potentially one of the bottlenecks during computation of the chunkwise transition so it would be helpful if we could save one of these.

Save one invert

Consider

M = I + tril (B_{[t]} K_{[t]} K_{[t]}^{T}, - 1)

and

N = I + tril (B_{[t]} (Γ_{[t]} ⊙ K_{[t]} K_{[t]}^{T}), - 1) .

Let $K_{[t]} \in ℝ^{C \times d_{k}}$ and denote by $k_{μ}$ the $μ$ -th row of $K_{[t]}$ . Then

(K_{[t]} K_{[t]}^{T})_{μ ν} = k_{μ}^{T} k_{ν} .

Since $B_{[t]}$ is diagonal, we write $B_{[t]} = diag (β_{1}, \dots, β_{C})$ .
For $μ > ν$ the strict lower triangular part of $M$ therefore has entries

M_{μ ν} = β_{μ} k_{μ}^{T} k_{ν}, M_{μ μ} = 1 .

Similarly, for $N$ we obtain

N_{μ ν} = β_{μ} Γ_{[t], μ ν} k_{μ}^{T} k_{ν}, N_{μ μ} = 1 .

We see that these look already highly similar and will make this more explicit below.

Plugging in the definition of $Γ$ we obtain

Γ_{[t], μ ν} = \frac{γ_{[t], μ}}{γ_{[t], ν}}

and therefore

N_{μ ν} = β_{μ} \frac{γ_{[t], μ}}{γ_{[t], ν}} k_{μ}^{T} k_{ν} .

Let us factor out the $γ$ factors. Define

G = diag (γ_{[t], 1}, \dots, γ_{[t], C}), G^{- 1} = diag (γ_{[t], 1}^{- 1}, \dots, γ_{[t], C}^{- 1}) .

For any matrix $A$ we have

(G A G^{- 1})_{μ ν} = γ_{[t], μ} A_{μ ν} γ_{[t], ν}^{- 1} .

Thus multiplying a matrix by $G$ on the left and $G^{- 1}$ on the right multiplies the $(μ, ν)$ entry by $γ_{[t], μ} γ_{[t], ν}^{- 1}$ .

Applying this observation to $A = tril (B_{[t]} K_{[t]} K_{[t]}^{T}, - 1)$ yields

tril (B_{[t]} (Γ_{[t]} ⊙ K_{[t]} K_{[t]}^{T}), - 1) = G tril (B_{[t]} K_{[t]} K_{[t]}^{T}, - 1) G^{- 1} .

Therefore we can write

N = I + G tril (B_{[t]} K_{[t]} K_{[t]}^{T}, - 1) G^{- 1} .

Since $I = G I G^{- 1}$ , we obtain

N = G (I + tril (B_{[t]} K_{[t]} K_{[t]}^{T}, - 1)) G^{- 1} = G M G^{- 1} .

Taking the inverse gives

N^{- 1} = (G M G^{- 1})^{- 1} = G M^{- 1} G^{- 1} .

Thus the factor $N$ appearing in ${\tilde{U}}_{[t]}$ can be expressed using the same inverse $M^{- 1}$ together with two diagonal matrix multiplications. This allows us to reuse the computed inverse and avoid performing a second matrix inversion.

Conclusion

In this short note we have seen how to derive simple math to speed up parallel algorithms in deep learning. Sometimes it is good to look carefully at the equations and bring them into their simplest form.

The observation how to "save the invert" was first made in Comba. Please take a look at their paper for an alternative derivation.