LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Regarding long context and quadratic attention (alien.top)

submitted 2 years ago by tt19234@alien.top to c/localllama@poweruser.forum

3 comments fedilink hide all child comments

Hello everyone.

I have been catching up with the recent literature regarding long context. I am reading kaiokendev post and NTK scaling. One question that I have and seem not answered by their post or any other posts is that, isn't attention quadratic? I still recall the O(L^2) complexity that is countless referred to in the older literature (e.g. longformer or bigbird). (I read somewhere that it turns out that it is actually the MLP that consumes the most in practice but still the longer the sequence length is the more memory it would take).

Based on my understanding of the current work on long context, they are tweaking the frequency or base in the rotary embedding, which makes the model interpolate unseen sequence length, but still, you need much more memory for 8k sequence input than 2k. Is this issue solved? I appreciate any pointers. Thanks!

top 3 comments

sorted by: hot top controversial new old

[–] Someone13574@alien.top 1 points 2 years ago

While there are some techniques for sub-quadratic transformer scaling (Longformer, Longnet, Retentive Network, and others), there haven't really been any state of the art models trained with these techniques and they cannot be applied to existing models, so they aren't really being used. As for why these techniques aren't being used in new models, that is either because they harm performance too much, or are considered to be too risky and more proven architectures are used instead. If there were models trained using these techniques and they were good, then people would use them, but for now there just isn't any models actually using them, so we are stuck with O(n^2) computation and memory.

[–] Aaaaaaaaaeeeee@alien.top 1 points 2 years ago

Yes, but most people use flash attention: https://hazyresearch.stanford.edu/blog/2023-07-17-flash2, making memory requirements linear for inference.

For cpu inferencing, it benefits the most from switching to a new architecture that does not require that. CPU has much less compute available for flash attention. Currently, I don't think ggml has been using flash attention.

[–] Aphid_red@alien.top 1 points 2 years ago

I've been doing some off-hand calculations about this and found that really, most base models look like they've been made with a context that is shorter than really needed. Quadratic scaling of attention is a problem, but not something any of the currently trained models have actually encountered, as their sizes are too small for the quadratic factor to be relevant.

The neural network uses c=ctx_len and d=hidden_dimension when computing data. Its size in parameters doesn't depend much on ctx_len at all, but does depend a lot on d (quadratically)!

How much memory besides parameters do we need to run the computation? Well, there's this matrix QK^(T), which is of shape c*c which is the only really quadratic step in the attention. Also there's c*3d for making the matrices Q, K, and V.

Next, there is the 'linear' step. This involves a matrix of size c*4d

Take c = d. The first number ends up 4d^(2). The second is also 4d^(2)

So it doesn't make much sense to take c < d: the 'attention' part of the transformer (quadratic) will be smaller (in memory and compute) than the 'feedforward' part of it (which is linear in c and d). Also, the attention part itself also includes a c*3d scaling thing, which is creating Q,K, and V (inputs to attention) out of X (the (masked) input to self-attend).

For example, llama has d=8192. So it seems c>=8192 is optimal (in terms of making attention/feedforward take roughly equal time and memory), but c=4096 is actually chosen.

For Falcon, d=14848. So c>=14848 would make sense, but c=2048 was chosen.

Why? Probably to do cheaper training. But it might make more sense to train a smaller dimension model with bigger context size.

That won't win you any of the current benchmarks, as benchmarks don't check for long context. As long as the questions fit in it, there's no gain in them for making it longer. This is a bit of a problem with the current benchmarks, if wanting to use LLMs with longer form content (like having it write a book).