LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

I scaled Mistral 7B to 200 GPUs in less than 5 minutes (alien.top)

submitted 2 years ago by Ok_Post_149@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

I've been working on a project with my roommate to make it incredibly simple to run batch inference on LLMs while leveraging a massive amount of cloud resources. We finally got the tool working and created a tutorial on how to use it on Mistral 7B.

Also, if you're a frequent HuggingFace user you can easily adapt the code to run inference on other LLM models. Please test it out and provide feedback, I feel really good about how easy it is to use but I want to figure out if anything is not intuitive. I hope the community is able to get some value out of it! Here is the link to the tutorial https://docs.burla.dev/Example:%20Massively%20Parallel%20Inference%20with%20Mistral-7B

top 5 comments

sorted by: hot top controversial new old

[–] Puzzleheaded-Pay-476@alien.top 1 points 2 years ago

This is actually really cool, it was simple and I didn’t run into any issues! Couple of major questions though, since you’re managing the infrastructure… how long are you going to let people use your compute for? Do I get a certain amount for free? How much once I need to start paying?

Unique concept, I like it

[–] caikenboeing727@alien.top 1 points 2 years ago

Very impressive! Lots of good use cases for this.

[–] DarthNebo@alien.top 1 points 2 years ago

You should look into continuous batching as most of your parallel requests are batch size 1 & heavily under utilising the VRAM & overall throughput that would have been easily possible.

[–] sergeant113@alien.top 1 points 2 years ago (1 children)

I’m in the middle of building my app on Modal. Guess I’ll adapt it to run on your service and see. Thanks for sharing!

[–] Ok_Post_149@alien.top 1 points 2 years ago

This is really cool! We are more focused on lengthy workloads so running 500k inputs through an LLM in one batch instead of on-demand inference (starting to support this). Right now the startup time is pretty long (2-5 minutes) but we are working on cutting it down.