m1ss1l3

joined 11 months ago
[–] m1ss1l3@alien.top 1 points 11 months ago (1 children)

I tried this but got a bunch of errors with the binary, can you share the versions of cuda and other dependencies needed for this?

[–] m1ss1l3@alien.top 1 points 11 months ago

this is pretty cool, thanks for sharing will try out and check performance

[–] m1ss1l3@alien.top 1 points 11 months ago

Thanks for all your work!!
The instance you used looks like it was 0.526 per hour which would fit our budget!!

Also, I want to make sure I'm reading the benchmark results right, is it correct that it took about 26s to serve all the 4 requests in parallel with the quantized model and the 2048+512 tokens assumption?

 

I run a micro saas app that would benefit a lot from using llama v2 to add some question & answering capabilities for customers' end users. We've already done some investigation with the 7B llama v2 base model and its responses are good enough to support the use case for us, however, given that its a micro business right now and we are not VC funded need to figure out the costs.

We process about 4 million messages per month of which we'd need to run 1M of them through the model and generate a response from it. Latency < 30 seconds would be required. So around ~23 messages/minute. # of tokens used would be ~4096 for each invocation.

Commercial models like Palm 2 or GPT X would be too expensive for us, wondering if there is a path to have a setup that can do this cost-efficiently. We have a bunch of GCP AI credits to fine-tune and experiment but they run out in less than a year so we need to think about the long-term sustainability. We can probably spare 500-1000 a month for the inference API with the hope that our customers will pay more $$ for this service.

Any guidance or benchmarks using various optimized models you can share would be very helpful.