LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

40x or more speedup by selecting important neurons (alien.top)

submitted 2 years ago by koehr@alien.top to c/localllama@poweruser.forum

15 comments fedilink hide all child comments

https://arxiv.org/abs/2311.10770

"UltraFastBERT", apparently a variant of BERT, that uses only 0.3% of it's neurons during inference, is performing on par with similar BERT models.

I hope that's going to be available for all kinds of models in the near future!

you are viewing a single comment's thread
view the rest of the comments

[–] andrewlapp@alien.top 1 points 2 years ago (1 children)

Here are my notes:

Overview:

As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide highlevel CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference.

Benchmarks Averages

base implementation with no neurons ignored: 79.9
~60% of neurons ignored: 79.2
~95% of neurons ignored: 78.1
~99.7% of neurons ignored: 77.3

Benchmarks that don't degrade at all as more neurons are ignored:

RTE ("Recognizing Textual Entailment", determining whether a statement can be inferred from a given text)
MRPC (ability to measure semantic similarity)
STSB (ability to measure semantic similarity)

Benchmarks that degrade:

SST-2: (sentiment analysis)
MNLI (determining whether a given statement is true, false, or unknown provided context)
QNLI (determine whether a sentence has an answer to a question)
QQP (determine whether one question is a paraphrase of another)

Benchmarks that degrade substantially:

CoLA, which is addressed in the paper:

Note, however, that the majority of the performance decrease due to the increasing depth is caused by only a single task – CoLA. This behaviour has previously been observed in the literature and is in line with other work trying to compress BERT behaviour into smaller models ... If we disregard CoLA, at least 98.6% of the predictive performance is preserved by all UltraFastBERT model.

Corpus of Linguistic Acceptability (CoLA): sentences annotated as grammatically acceptable or not by experts.

Applicability to CausalLM such as Llama 2

We also observe that the performance decreases with the increasing depth of the FFFs.

With substantially more FF layers in Llama 2, this is concerning. Additionally, it's not obvious to me that this works with a 7B to 70B parameter causal language model just because it it works with a ~100M parameter bidirectional encoder. Would be great to see it tested however!

Other

Only works on CPU due to GPUs not supporting "conditional matrix multiplication"

[–] ReturningTarzan@alien.top 1 points 2 years ago

To add to that: GPUs do support "conditional" matrix multiplication, they just don't benefit from that type of optimization. Essentially, it takes as much time to skip a computation as it does to perform it. And in practice it can even take longer since the extra logic required to keep track of which computations to skip will add overhead.

In order for this to make sense on a GPU you need a way of completely sidestepping portions of the model, like the ability to skip whole layers that are not relevant (a bit how MoE works already). If you have to load a weight from memory, or some sort of metadata to figure out what each individual weight is connected to, you've already allocated as many resources to that weight as you would if you simply used it in a streamlined matrix multiplication.

The same also holds to a lesser extent for efficient CPU implementations that also rely on SIMD computations, regular memory layouts and predictable control flows.