this post was submitted on 22 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

https://arxiv.org/abs/2311.10770

"UltraFastBERT", apparently a variant of BERT, that uses only 0.3% of it's neurons during inference, is performing on par with similar BERT models.

I hope that's going to be available for all kinds of models in the near future!

you are viewing a single comment's thread
view the rest of the comments
[–] MoffKalast@alien.top 1 points 11 months ago (3 children)

Would be interesting to see if this can help speed up CPU inference with regular RAM, after all 128 GB of DDR5 only costs like $300 which is peanuts compared to trying to get any where close as much VRAM.

If it scales linearly then one could run a 100B model at the speed of a 3B one right now.

[–] OnurCetinkaya@alien.top 1 points 11 months ago (2 children)

I am just gonna do some bad maths.

For the price of single 4090 you can get

CPU Mainboard combo with 16 ram slots. $1,320

16 x 32 GB ddr4 ram $888

  • total 512 GB

Mistral 7B runs around 7 tokens per second on a regular CPU, that is like 5 words per second.

On above setups 512 GB ram size we can fit a 512B parameters model, that will run 5*7/512=0.068 words per second with the current architecture, if this new architecture actually works and give 78x speed up it will be 5.3 words per second, the average persons reading speed is around 4 words per second. And average persons speaking speed is around 2 words per second.

Fingers crossed this can put a small dent on Nvidia's stock price.

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

Fingers crossed this can put a small dent on Nvidia's stock price.

If it works that way, it will only be short term. Since the only reason it doesn't run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they'll will be back on top with the same margins again.

Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.

[–] MoffKalast@alien.top 1 points 11 months ago

I doubt it, most of their leverage is in being the only suppliers of hardware required for pretraining foundational models. This doesn't really change that.