LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

on-demand inference or batch inference? (alien.top)

submitted 1 year ago by Ok_Post_149@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

Hey All,

what does making a model prediction look like for your current projects? Are you building a model for a web-app and you're running on-demand inference? Are you working on a research project or doing some analysis that requires making hundreds of thousands to millions of predictions all at once?

I'm currently at a crossroads with a developer tool I'm building and trying to figure out which types of inference workflows I should be focused on. A few weeks back I posted a tutorial on running Mistral-7B on hundreds of GPUs in the cloud in parallel. I got a decent amount of people saying that batch inference is relevant to them but over the last couple of days I've been running into more and more developers that are building web-apps that don't need to make many predictions all at once. If you were me where would you direct your focus?

Anyways, I'm kinda rambling but I would love to know what you guys are working on and get some advice on the direction I should pursue.

top 5 comments

sorted by: hot top controversial new old

[–] AdamDhahabi@alien.top 1 points 1 year ago (2 children)

I think batched inference is a must for companies who want to put an on-premise chatbot in front of their users. This is a use case many are busy with at the moment. I saw llama.cpp now supports batched inference, only since 2 weeks, I don't have hands-on experience with it yet.

[–] Ok_Post_149@alien.top 1 points 1 year ago

Thanks for this feedback, what is your definition of an on-prem chatbot? Hosted on their physical infrastructure?

[–] matkley12@alien.top 1 points 1 year ago

Does llama.cpp support batch inference on CPU ?

[–] Hoblywobblesworth@alien.top 1 points 1 year ago (1 children)

I think in the longer term the demand for the "do 10,000 generations at once" will rise. Chatbots and chat-based interfaces that have fairly spread out/consistent traffic flow are the first widely propagating use case for LLMs but they are a bit gimmicky. There are and will be plenty of very specific, niche domain use cases where you will want the hundreds/thousands generations at once and then not see traffic again for days/weeks until a next sudden spike.

If your current demand is from chatbots then build that, but once other industries and domains start to figure out how best to use LLMs, I reckon there will be growth in demand for cloud compute that can handle infrequent but super spikey inference requests.

[–] Ok_Post_149@alien.top 1 points 1 year ago

This is really useful feedback, I'd definitely be able to produce a revenue generating product faster if I focus on chatbots... so in terms of trying to get funding for this idea that seems to be the better avenue. In the future I could definitely address both use cases but trying not to spread myself too thin at the moment.