I'd go the Koboldcpp route instead because its portable for them so its much simpler to install and use. Koboldcpp has API documentation available if you add /api to a working link (Or you can just check it here). If you already made it for the OpenAI compatible API stuff it supports that to.
this post was submitted on 24 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
The fastest way would be to ingest the ggerganov server.cpp module & make HTTP calls to it. Way easier to package into other apps & supports parallel decoding with 30tok/s on Apple Silicon(M1 Pro)
look at the api examples in the ooba code.