this post was submitted on 31 Oct 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Run this with TGI or vLLM
What's the latest t/s on a 4bit model with TGI? is there a difference compared with HF transformer loader?
The attention layers get replaced with flash attention 2, there's kv caching as well so you get way better batch1 & batchN results with continuous batching for every request