this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hi all,

Just curious if anybody knows the power required to make a llama server which can serve multiple users at once.

Any discussion is welcome:)

you are viewing a single comment's thread
view the rest of the comments
[–] a_beautiful_rhind@alien.top 1 points 9 months ago

You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

The "minimum" is one GPU that completely fits the size and quant of the model you are serving.

People serve lots of users through kobold horde using only single and dual GPU configurations so this isn't something you'll need 10s of 1000s for.