this post was submitted on 30 Oct 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I've been working on a project with my roommate to make it incredibly simple to run batch inference on LLMs while leveraging a massive amount of cloud resources. We finally got the tool working and created a tutorial on how to use it on Mistral 7B.

Also, if you're a frequent HuggingFace user you can easily adapt the code to run inference on other LLM models. Please test it out and provide feedback, I feel really good about how easy it is to use but I want to figure out if anything is not intuitive. I hope the community is able to get some value out of it! Here is the link to the tutorial https://docs.burla.dev/Example:%20Massively%20Parallel%20Inference%20with%20Mistral-7B

you are viewing a single comment's thread
view the rest of the comments
[–] DarthNebo@alien.top 1 points 10 months ago

You should look into continuous batching as most of your parallel requests are batch size 1 & heavily under utilising the VRAM & overall throughput that would have been easily possible.