this post was submitted on 24 Nov 2023
1 points (100.0% liked)

LocalLLaMA

4 readers
4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago
MODERATORS
 

Hey guys,

I'm running the quantized version of mistral-7B-instruct and its pretty fast and accurate for my use case. On my PC I'm generating approximately 4 tokens per second with the idea of generating one-sentence responses for my NPC characters, which is good enough for what I need.

After fiddling around with oobabooga a bit I found out that you can perform API calls on localhost and print out the text, which is exactly what I need for this to work.

The issue I'm running into here is that if I were to make a game with AI-generated content, how can I make it easy for players to run their own localhost and perform api calls in the game this way? I feel like for the unexperienced, setting all this up would be a nightmare for them and I don't want to alienate non-tech savvy players.

top 3 comments
sorted by: hot top controversial new old
[–] henk717@alien.top 1 points 2 years ago

I'd go the Koboldcpp route instead because its portable for them so its much simpler to install and use. Koboldcpp has API documentation available if you add /api to a working link (Or you can just check it here). If you already made it for the OpenAI compatible API stuff it supports that to.

[–] DarthNebo@alien.top 1 points 2 years ago

The fastest way would be to ingest the ggerganov server.cpp module & make HTTP calls to it. Way easier to package into other apps & supports parallel decoding with 30tok/s on Apple Silicon(M1 Pro)

[–] gthing@alien.top 1 points 2 years ago

look at the api examples in the ooba code.