LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

40x or more speedup by selecting important neurons (alien.top)

submitted 2 years ago by koehr@alien.top to c/localllama@poweruser.forum

15 comments fedilink hide all child comments

https://arxiv.org/abs/2311.10770

"UltraFastBERT", apparently a variant of BERT, that uses only 0.3% of it's neurons during inference, is performing on par with similar BERT models.

I hope that's going to be available for all kinds of models in the near future!

you are viewing a single comment's thread
view the rest of the comments

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

Fingers crossed this can put a small dent on Nvidia's stock price.

If it works that way, it will only be short term. Since the only reason it doesn't run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they'll will be back on top with the same margins again.

Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.