this post was submitted on 09 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hello everyone.

I have been catching up with the recent literature regarding long context. I am reading kaiokendev post and NTK scaling. One question that I have and seem not answered by their post or any other posts is that, isn't attention quadratic? I still recall the O(L^2) complexity that is countless referred to in the older literature (e.g. longformer or bigbird). (I read somewhere that it turns out that it is actually the MLP that consumes the most in practice but still the longer the sequence length is the more memory it would take).

Based on my understanding of the current work on long context, they are tweaking the frequency or base in the rotary embedding, which makes the model interpolate unseen sequence length, but still, you need much more memory for 8k sequence input than 2k. Is this issue solved? I appreciate any pointers. Thanks!

you are viewing a single comment's thread
view the rest of the comments
[–] Aaaaaaaaaeeeee@alien.top 1 points 1 year ago

Yes, but most people use flash attention: https://hazyresearch.stanford.edu/blog/2023-07-17-flash2, making memory requirements linear for inference.

For cpu inferencing, it benefits the most from switching to a new architecture that does not require that. CPU has much less compute available for flash attention. Currently, I don't think ggml has been using flash attention.