overview for Ok_Post

1

any open source LLM you want scaled to 200 gpus I will create a tutorial for (alien.top)

submitted 11 months ago by Ok_Post_149@alien.top to c/localllama@poweruser.forum

6 comments fedilink

I'm trying to perfect a dev tool for python developers to easily scale their code to thousands of cloud resources using only one line of code.

I want to get some project ideas so I can build useful tutorials for running inference and fine tuning open source LLMs.

A few weeks back I created a tutorial teaching people to massively parallelize inference with Mistral-7B. I was able to deliver a ton of value to a select few people and it helped me better understand the flaws with my tool.

Anyways I want to open it up to the community before I decide what tutorials I should prioritize. Please drop any project/tutorial ideas and if you think someone's idea is good please upvote them (so I know you think it would be valuable).

on-demand inference or batch inference? in c/localllama@poweruser.forum

[–] Ok_Post_149@alien.top 1 points 1 year ago

This is really useful feedback, I'd definitely be able to produce a revenue generating product faster if I focus on chatbots... so in terms of trying to get funding for this idea that seems to be the better avenue. In the future I could definitely address both use cases but trying not to spread myself too thin at the moment.

on-demand inference or batch inference? in c/localllama@poweruser.forum

[–] Ok_Post_149@alien.top 1 points 1 year ago

Thanks for this feedback, what is your definition of an on-prem chatbot? Hosted on their physical infrastructure?

1

on-demand inference or batch inference? (alien.top)

submitted 1 year ago by Ok_Post_149@alien.top to c/localllama@poweruser.forum

5 comments fedilink

Hey All,

what does making a model prediction look like for your current projects? Are you building a model for a web-app and you're running on-demand inference? Are you working on a research project or doing some analysis that requires making hundreds of thousands to millions of predictions all at once?

I'm currently at a crossroads with a developer tool I'm building and trying to figure out which types of inference workflows I should be focused on. A few weeks back I posted a tutorial on running Mistral-7B on hundreds of GPUs in the cloud in parallel. I got a decent amount of people saying that batch inference is relevant to them but over the last couple of days I've been running into more and more developers that are building web-apps that don't need to make many predictions all at once. If you were me where would you direct your focus?

Anyways, I'm kinda rambling but I would love to know what you guys are working on and get some advice on the direction I should pursue.

What is the tiniest GPT model one can fine tune on home hardware? in c/localllama@poweruser.forum

[–] Ok_Post_149@alien.top 1 points 1 year ago

Is home hardware a requirement for this project? I guess I'm a little confused what that has to do with model hallucinations.

need advice for reducing inference time in c/localllama@poweruser.forum

[–] Ok_Post_149@alien.top 1 points 1 year ago

I just wrote a tutorial on how you can scale Mistral-7b to many GPUs in the cloud. I hope this can give you some value. Not sure if you're looking to do on-demand inference or inference on a bunch of inputs.

https://www.reddit.com/r/LocalLLaMA/comments/17k2x62/i_scaled_mistral_7b_to_200_gpus_in_less_than_5/

I scaled Mistral 7B to 200 GPUs in less than 5 minutes in c/localllama@poweruser.forum

[–] Ok_Post_149@alien.top 1 points 1 year ago

This is really cool! We are more focused on lengthy workloads so running 500k inputs through an LLM in one batch instead of on-demand inference (starting to support this). Right now the startup time is pretty long (2-5 minutes) but we are working on cutting it down.

1

I scaled Mistral 7B to 200 GPUs in less than 5 minutes (alien.top)

submitted 1 year ago by Ok_Post_149@alien.top to c/localllama@poweruser.forum

5 comments fedilink

I've been working on a project with my roommate to make it incredibly simple to run batch inference on LLMs while leveraging a massive amount of cloud resources. We finally got the tool working and created a tutorial on how to use it on Mistral 7B.

Also, if you're a frequent HuggingFace user you can easily adapt the code to run inference on other LLM models. Please test it out and provide feedback, I feel really good about how easy it is to use but I want to figure out if anything is not intuitive. I hope the community is able to get some value out of it! Here is the link to the tutorial https://docs.burla.dev/Example:%20Massively%20Parallel%20Inference%20with%20Mistral-7B