this post was submitted on 27 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.
There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.
The "minimum" is one GPU that completely fits the size and quant of the model you are serving.
People serve lots of users through kobold horde using only single and dual GPU configurations so this isn't something you'll need 10s of 1000s for.