Hello everyone.
I have been catching up with the recent literature regarding long context. I am reading kaiokendev post and NTK scaling. One question that I have and seem not answered by their post or any other posts is that, isn't attention quadratic? I still recall the O(L^2) complexity that is countless referred to in the older literature (e.g. longformer or bigbird). (I read somewhere that it turns out that it is actually the MLP that consumes the most in practice but still the longer the sequence length is the more memory it would take).
Based on my understanding of the current work on long context, they are tweaking the frequency or base in the rotary embedding, which makes the model interpolate unseen sequence length, but still, you need much more memory for 8k sequence input than 2k. Is this issue solved? I appreciate any pointers. Thanks!
While there are some techniques for sub-quadratic transformer scaling (Longformer, Longnet, Retentive Network, and others), there haven't really been any state of the art models trained with these techniques and they cannot be applied to existing models, so they aren't really being used. As for why these techniques aren't being used in new models, that is either because they harm performance too much, or are considered to be too risky and more proven architectures are used instead. If there were models trained using these techniques and they were good, then people would use them, but for now there just isn't any models actually using them, so we are stuck with O(n^2) computation and memory.