Basically gpt 4 turbo
LocalLLaMA
Community to discuss about Llama, the family of large language models created by Meta AI.
GPT-4 turbo only speeds things up by 3x…
Would be interesting to see if this can help speed up CPU inference with regular RAM, after all 128 GB of DDR5 only costs like $300 which is peanuts compared to trying to get any where close as much VRAM.
If it scales linearly then one could run a 100B model at the speed of a 3B one right now.
I am just gonna do some bad maths.
For the price of single 4090 you can get
CPU Mainboard combo with 16 ram slots. $1,320
- total 512 GB
Mistral 7B runs around 7 tokens per second on a regular CPU, that is like 5 words per second.
On above setups 512 GB ram size we can fit a 512B parameters model, that will run 5*7/512=0.068 words per second with the current architecture, if this new architecture actually works and give 78x speed up it will be 5.3 words per second, the average persons reading speed is around 4 words per second. And average persons speaking speed is around 2 words per second.
Fingers crossed this can put a small dent on Nvidia's stock price.
Fingers crossed this can put a small dent on Nvidia's stock price.
If it works that way, it will only be short term. Since the only reason it doesn't run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they'll will be back on top with the same margins again.
Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.
I doubt it, most of their leverage is in being the only suppliers of hardware required for pretraining foundational models. This doesn't really change that.
Here are my notes:
Overview:
As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide highlevel CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference.
Benchmarks Averages
- base implementation with no neurons ignored: 79.9
- ~60% of neurons ignored: 79.2
- ~95% of neurons ignored: 78.1
- ~99.7% of neurons ignored: 77.3
Benchmarks that don't degrade at all as more neurons are ignored:
- RTE ("Recognizing Textual Entailment", determining whether a statement can be inferred from a given text)
- MRPC (ability to measure semantic similarity)
- STSB (ability to measure semantic similarity)
Benchmarks that degrade:
- SST-2: (sentiment analysis)
- MNLI (determining whether a given statement is true, false, or unknown provided context)
- QNLI (determine whether a sentence has an answer to a question)
- QQP (determine whether one question is a paraphrase of another)
Benchmarks that degrade substantially:
CoLA, which is addressed in the paper:
Note, however, that the majority of the performance decrease due to the increasing depth is caused by only a single task – CoLA. This behaviour has previously been observed in the literature and is in line with other work trying to compress BERT behaviour into smaller models ... If we disregard CoLA, at least 98.6% of the predictive performance is preserved by all UltraFastBERT model.
Corpus of Linguistic Acceptability (CoLA): sentences annotated as grammatically acceptable or not by experts.
Applicability to CausalLM such as Llama 2
We also observe that the performance decreases with the increasing depth of the FFFs.
With substantially more FF layers in Llama 2, this is concerning. Additionally, it's not obvious to me that this works with a 7B to 70B parameter causal language model just because it it works with a ~100M parameter bidirectional encoder. Would be great to see it tested however!
Other
- Only works on CPU due to GPUs not supporting "conditional matrix multiplication"
To add to that: GPUs do support "conditional" matrix multiplication, they just don't benefit from that type of optimization. Essentially, it takes as much time to skip a computation as it does to perform it. And in practice it can even take longer since the extra logic required to keep track of which computations to skip will add overhead.
In order for this to make sense on a GPU you need a way of completely sidestepping portions of the model, like the ability to skip whole layers that are not relevant (a bit how MoE works already). If you have to load a weight from memory, or some sort of metadata to figure out what each individual weight is connected to, you've already allocated as many resources to that weight as you would if you simply used it in a streamlined matrix multiplication.
The same also holds to a lesser extent for efficient CPU implementations that also rely on SIMD computations, regular memory layouts and predictable control flows.
Remind Me! 15 Day “ 40x BERT ”
CPU speed ups... So... Mac's are back in the game for local LLM?
Future is going to be interesting. With this kind of CPU speedup we can run blazing fast LLMs on a toaster if it has enough RAM.
Does this technique affect the required RAM-size for inference?
I don't think so (unfortunately). The model size doesn't change, only the way it is traversed.
Can this technique be combined with lora with a not so low rank? Lora increases the learning time (I heard) but this should be no problem then anymore :)
I wonder if you can pass a large dataset of prompts to perform a certain relatively narrow task and see which neurons get activated. And then use statistical measures to add a few surrounding neurons just in case.
Bet you can get away with near zero reduction in size and massive parameter compression.