this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hi all,

Just curious if anybody knows the power required to make a llama server which can serve multiple users at once.

Any discussion is welcome:)

top 10 comments
sorted by: hot top controversial new old
[–] SupplyChainNext@alien.top 1 points 11 months ago

figure out the size and speed you need. Buy the Nvidia pro gpus (A series) x 20-50 + the server cluster hardware and network infrastructure needed to make them run efficiently.

Think in the several hundred thousand dollar range. I’ve looked into it.

[–] Tiny_Arugula_5648@alien.top 1 points 11 months ago

unless you're doing this as a business it's going to be massively cost prohibitive, hundreds of thousands dollars of hardware. If it is a business you better get talking to cloud vendors because GPUs are an incredibly scarce resource right now.

[–] seanpuppy@alien.top 1 points 11 months ago (1 children)

It depends a lot on the details tbh. Do they share one model? Do they each use a different lora? If its the latter theres some cool recent research on efficiently hosting many loras on one machine

[–] Appropriate-Tax-9585@alien.top 1 points 11 months ago

At the moment I’m just trying to grasp the basics, like for example what kind of GPUS I will need and how many. This is more for comparison to SaaS options, however in reality I need to setup a server for testing with just few users. I’m going to research into but I like this community and to hear others view on the case as many have tried to manage their own servers I imagine :)

[–] Prudent-Artichoke-19@alien.top 1 points 11 months ago

One or two a6000s can serve a 70b with decent tps for 20 people. You can run a swarm using petals and just add a gpu as needed. LLM sharding can be pretty useful.

[–] pablines@alien.top 1 points 11 months ago

Hugging face text inference can handle concurrency you just need to power with gpus

[–] a_beautiful_rhind@alien.top 1 points 11 months ago

You would have to benchmark batching speed in something like llama.cpp or exllamav2 and then divide it by the users to see what they get per request.

There are some other backends like MLC/tgi/vllm that are more adapted to this as well but have way worse quant support.

The "minimum" is one GPU that completely fits the size and quant of the model you are serving.

People serve lots of users through kobold horde using only single and dual GPU configurations so this isn't something you'll need 10s of 1000s for.

[–] Aggressive-Drama-899@alien.top 1 points 11 months ago (1 children)

We run llama 2 70b for around 20-30 active users using TGI and 4xA100 80gb on Kubernetes. If 2 users send a request at the exact same time, there is about a 3-4 second delay for the second user. Never really had any complaints around speed from people as of yet. We do have the ability to spin up multiple new containers if it became a problem though. This is all on prem

[–] Appropriate-Tax-9585@alien.top 1 points 11 months ago

Thank you, this is really good to hear!

[–] dododragon@alien.top 1 points 11 months ago

Have a look at https://www.runpod.io/ for AI cloud hosting. You could do some testing based on the number of users you want to cater for, and see what capacity you'll get for your $.

Start with a basic plan, run some tests to see what it can handle and compare it as you scale up the number of users with simultaneous queries.