this post was submitted on 23 Nov 2023
1 points (100.0% liked)
LocalLLaMA
4 readers
4 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
What quantization are you using? Smaller tends to be faster.
I get 30 tokens/s with a q4_0 quantization of 13B models on a M1 Max on Ollama (which uses llama.cpp). You should be in the same ballpark with the same software. You aren't going to do much/any better than that. The M3's GPU made some significant leaps for graphics, and little to nothing for LLMs.
Allowing more threads isn't going to help generation speed, it might improve prompt processing though. Probably best though to keep the number of threads to the number of performance cores.