Great work! Would you mind sharing the datasets you used and/or how you augmented the data for training?
andrewlapp
VRAM scales quadratically as sequence length increases. I'm not aware of any solutions. Even efficient implementations of long context fine tuning such as LongLoRA only improve speed and quality, but leave memory usage the same as LoRA.
I recommend ensuring you're reducing memory in other ways:
-
Ensure you're using 4-bit QLoRA
-
Ensure batch size is 1
-
Ensure you're using FlashAttention-2
-
Ensure your optimizer state in in CPU memory by utilizing a paged optimizer.
-
Use gradient checkpointing.
You also could do something more experimental like employ Mistral with a sliding window of 1024 tokens to capture 2048 tokens of context while only using the memory of 1024 tokens.
Or you could just summarize or prune your long examples.
Say I've got a paragraph about something and the text block contains some other unrelated comment
Have you considered creating text embeddings, calculating their distance matrix, and applying pagerank?
Here are some factors that may help induce repetition:
-
- Llama 2 7B, Mistral 7B, or Yi 6B variant
-
- Use a lossy quantization such as Q2_K (2 bit), Q4_0 (4 bit), or GPTQ (4 bit)
-
- Use a sequence length of at least 1024 tokens, if not 2048
-
- Use a text corpus with a lot of repetition, e.g. https://github.com/Lyrics/lyrics
Additionally, you should use lm-evaluation-harness to test for any degradation in performance in common benchmarks.
Very interesting idea. If you can create a simple benchmark (really just any prompt applied to a noisy 7B model) and demonstrate a reduction in repetition compared to baseline, this method will proliferate across the open LLM development ecosystem.
Looking forward to seeing your implementation!
I agree, and The Leaderboard's newly added DROP metric is a step in the right direction.
34B Model Memory Requirements (infer)
Sequence Length vs Bit Precision
SL / BP | 4 | 6 | 8 | 16
-----------------------------------------------------------
512 | 15.9GB | 23.8GB | 31.8GB | 63.6GB
1024 | 16.0GB | 23.9GB | 31.9GB | 63.8GB
2048 | 16.1GB | 24.1GB | 32.2GB | 64.3GB
4096 | 16.3GB | 24.5GB | 32.7GB | 65.3GB
8192 | 16.8GB | 25.2GB | 33.7GB | 67.3GB
16384 | 17.8GB | 26.7GB | 35.7GB | 71.3GB
32768 | 19.8GB | 29.7GB | 39.7GB | 79.3GB
65536 | 23.8GB | 35.7GB | 47.7GB | 95.3GB
131072 | 31.8GB | 47.7GB | 63.7GB | 127.3GB
262144 | 47.8GB | 71.7GB | 95.7GB | 191.3GB
-
It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you're presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
-
No, it's mostly bound by memory bandwidth.
-
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.
You might rent a GPU from runpod or another cloud provider.
Memory requirements:
34B Model Memory Requirements (infer)
Seq Len vs Bit Precision
SL / BP | 4 | 6 | 8 | 16
-----------------------------------------------------------
512 | 15.9GB | 23.8GB | 31.8GB | 63.6GB
1024 | 16.0GB | 23.9GB | 31.9GB | 63.8GB
2048 | 16.1GB | 24.1GB | 32.2GB | 64.3GB
4096 | 16.3GB | 24.5GB | 32.7GB | 65.3GB
8192 | 16.8GB | 25.2GB | 33.7GB | 67.3GB
16384 | 17.8GB | 26.7GB | 35.7GB | 71.3GB
You might rent a GPU from runpod or another cloud provider.
Memory requirements:
34B Model Memory Requirements (infer)
Seq Len vs Bit Precision
SL / BP | 4 | 6 | 8 | 16
-----------------------------------------------------------
512 | 15.9GB | 23.8GB | 31.8GB | 63.6GB
1024 | 16.0GB | 23.9GB | 31.9GB | 63.8GB
2048 | 16.1GB | 24.1GB | 32.2GB | 64.3GB
4096 | 16.3GB | 24.5GB | 32.7GB | 65.3GB
8192 | 16.8GB | 25.2GB | 33.7GB | 67.3GB
16384 | 17.8GB | 26.7GB | 35.7GB | 71.3GB
Here are my notes:
Overview:
As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide highlevel CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference.
Benchmarks Averages
- base implementation with no neurons ignored: 79.9
- ~60% of neurons ignored: 79.2
- ~95% of neurons ignored: 78.1
- ~99.7% of neurons ignored: 77.3
Benchmarks that don't degrade at all as more neurons are ignored:
- RTE ("Recognizing Textual Entailment", determining whether a statement can be inferred from a given text)
- MRPC (ability to measure semantic similarity)
- STSB (ability to measure semantic similarity)
Benchmarks that degrade:
- SST-2: (sentiment analysis)
- MNLI (determining whether a given statement is true, false, or unknown provided context)
- QNLI (determine whether a sentence has an answer to a question)
- QQP (determine whether one question is a paraphrase of another)
Benchmarks that degrade substantially:
CoLA, which is addressed in the paper:
Note, however, that the majority of the performance decrease due to the increasing depth is caused by only a single task – CoLA. This behaviour has previously been observed in the literature and is in line with other work trying to compress BERT behaviour into smaller models ... If we disregard CoLA, at least 98.6% of the predictive performance is preserved by all UltraFastBERT model.
Corpus of Linguistic Acceptability (CoLA): sentences annotated as grammatically acceptable or not by experts.
Applicability to CausalLM such as Llama 2
We also observe that the performance decreases with the increasing depth of the FFFs.
With substantially more FF layers in Llama 2, this is concerning. Additionally, it's not obvious to me that this works with a 7B to 70B parameter causal language model just because it it works with a ~100M parameter bidirectional encoder. Would be great to see it tested however!
Other
- Only works on CPU due to GPUs not supporting "conditional matrix multiplication"
Great work, this is impressive, especially for a 13B model!