LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

40x or more speedup by selecting important neurons (alien.top)

submitted 2 years ago by koehr@alien.top to c/localllama@poweruser.forum

15 comments fedilink hide all child comments

https://arxiv.org/abs/2311.10770

"UltraFastBERT", apparently a variant of BERT, that uses only 0.3% of it's neurons during inference, is performing on par with similar BERT models.

I hope that's going to be available for all kinds of models in the near future!

you are viewing a single comment's thread
view the rest of the comments

[–] ReturningTarzan@alien.top 1 points 2 years ago

To add to that: GPUs do support "conditional" matrix multiplication, they just don't benefit from that type of optimization. Essentially, it takes as much time to skip a computation as it does to perform it. And in practice it can even take longer since the extra logic required to keep track of which computations to skip will add overhead.

In order for this to make sense on a GPU you need a way of completely sidestepping portions of the model, like the ability to skip whole layers that are not relevant (a bit how MoE works already). If you have to load a weight from memory, or some sort of metadata to figure out what each individual weight is connected to, you've already allocated as many resources to that weight as you would if you simply used it in a streamlined matrix multiplication.

The same also holds to a lesser extent for efficient CPU implementations that also rely on SIMD computations, regular memory layouts and predictable control flows.