I get about 30 t/s on my 12Gb 4070Ti with Zephyr, so something is definitely borked. 0.8 is what I would expect from a 70b model running on CPU and system RAM. Make sure you're offloading as many layers to GPU as your system can handle (in this case, all of them).
this post was submitted on 31 Oct 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
Run this with TGI or vLLM
What's the latest t/s on a 4bit model with TGI? is there a difference compared with HF transformer loader?
The attention layers get replaced with flash attention 2, there's kv caching as well so you get way better batch1 & batchN results with continuous batching for every request
What is TGI?
Sounds like you are executing that with CPU. when you do nvidia-smi, do you see memory and GPU consumption?