LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Need help setting up a cost-efficient llama v2 inference API for my micro saas app (alien.top)

submitted 11 months ago by m1ss1l3@alien.top to c/localllama@poweruser.forum

12 comments fedilink hide all child comments

I run a micro saas app that would benefit a lot from using llama v2 to add some question & answering capabilities for customers' end users. We've already done some investigation with the 7B llama v2 base model and its responses are good enough to support the use case for us, however, given that its a micro business right now and we are not VC funded need to figure out the costs.

We process about 4 million messages per month of which we'd need to run 1M of them through the model and generate a response from it. Latency < 30 seconds would be required. So around ~23 messages/minute. # of tokens used would be ~4096 for each invocation.

Commercial models like Palm 2 or GPT X would be too expensive for us, wondering if there is a path to have a setup that can do this cost-efficiently. We have a bunch of GCP AI credits to fine-tune and experiment but they run out in less than a year so we need to think about the long-term sustainability. We can probably spare 500-1000 a month for the inference API with the hope that our customers will pay more $$ for this service.

Any guidance or benchmarks using various optimized models you can share would be very helpful.

you are viewing a single comment's thread
view the rest of the comments

[–] noobgolang@alien.top 1 points 11 months ago (7 children)

you can try https://nitro.jan.ai/ its built for this purpose

[–] MannowLawn@alien.top 1 points 11 months ago (4 children)

chrome is marking the download as suspicious from the github repo

[–] noobgolang@alien.top 1 points 11 months ago (3 children)

also the build is 100% built in public with the source code on the page, you can check the Actions button to see it, there is nothing hidden here

[–] MannowLawn@alien.top 1 points 11 months ago (1 children)

thanks, ill have a look. It seems very promising with my use case as well. Btw is nitro different than the download you have on the main page? Nitro seems only for m1 models of apple and on main page it mentions m2 models as well?

[–] noobgolang@alien.top 1 points 11 months ago (1 children)

m1 models of apple and on main page it mentions m2 models as well?

yeah arm64 mac should be able to run on all mac m1 and m2 including, we also have cuda version in the release

[–] MannowLawn@alien.top 1 points 11 months ago

cheers! ill keep a close watch on this, nice work!

load more comments (1 replies)

load more comments (3 replies)