Simple math to speed up GDN prefill
In this short note we will briefly derive a helpful identity to speedup GDN prefill algorithm. In my simple Torch implementation of the algorithm that gave already a good speedup of about 18%. I assume for custom CUDA C++ kernel the gains can even be more pronounced.
Please read my previous post on GDN for background.
Reminder
The state transition for GDN is as follows:
Define the cumulative gate
and for
We obtained the following chunkwise transition rule
We had (up to a decay factor)
and
We see that there are two inverts involved these computations. Invert is potentially one of the bottlenecks during computation of the chunkwise transition so it would be helpful if we could save one of these.
Save one invert
Consider
and
Let and denote by the -th row of . Then
Since is diagonal, we write .
For the strict lower triangular part of therefore has entries
Similarly, for we obtain
We see that these look already highly similar and will make this more explicit below.
Plugging in the definition of we obtain
and therefore
Let us factor out the factors. Define
For any matrix we have
Thus multiplying a matrix by on the left and on the right multiplies the entry by .
Applying this observation to yields
Therefore we can write
Since , we obtain
Taking the inverse gives
Thus the factor appearing in can be expressed using the same inverse together with two diagonal matrix multiplications. This allows us to reuse the computed inverse and avoid performing a second matrix inversion.
Conclusion
In this short note we have seen how to derive simple math to speed up parallel algorithms in deep learning. Sometimes it is good to look carefully at the equations and bring them into their simplest form.
The observation how to "save the invert" was first made in Comba. Please take a look at their paper for an alternative derivation.