LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

A new way to speed up the work of transformers. (alien.top)

submitted 2 years ago by BIITKN@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

Has anyone already read this new article on ArXiv? https://arxiv.org/abs/2311.10770

Looks very promising, potential inference acceleration of PyTorch x30, and when implemented on native CUDA x117, and also an estimate of the maximum acceleration x341 times.

As far as I understand, this is achieved by replacing traditional forward propagation layers with so-called fast forward propagation layers.

Is there anyone here with real experience of contributing to the development of PyTorch, llama.cpp or releasing open models, what do you say to this?

top 5 comments

sorted by: hot top controversial new old

[–] BalorNG@alien.top 1 points 2 years ago

I say:

It has a performance hit, but it remains to be seen if going with a much larger model can compensate for that.
The model needs to be trained from scratch, you cannot finetune an existing model for this apparently...

[–] Wonderful_Ad_5134@alien.top 1 points 2 years ago (1 children)

" we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation"

Big if true, we wouldn't need to buy 3090 cards anymore to get sufficiant memory, just buying more RAM would suffice

[–] pmp22@alien.top 1 points 2 years ago

Huge, if true.

[–] luxsteele@alien.top 1 points 2 years ago

you might want to read here: https://www.reddit.com/r/MachineLearning/comments/1815a05/r_exponentially_faster_language_modelling/

[–] yoomiii@alien.top 1 points 2 years ago

https://www.reddit.com/r/LocalLLaMA/comments/1815czk/exponentially_faster_language_modelling_4078x