LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

submitted 2 years ago by Appropriate-Tax-9585@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

Hi all,

Just curious if anybody knows the power required to make a llama server which can serve multiple users at once.

Any discussion is welcome:)

you are viewing a single comment's thread
view the rest of the comments

[–] a_beautiful_rhind@alien.top 1 points 2 years ago

You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

The "minimum" is one GPU that completely fits the size and quant of the model you are serving.

People serve lots of users through kobold horde using only single and dual GPU configurations so this isn't something you'll need 10s of 1000s for.