this post was submitted on 17 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hey guys, as the title suggests I'd like some advice on the best way to serve LLMs with the support of GBNF or similar to ensure that I receive deterministic output. I have been using text-generation-web-ui locally and from there I can add my grammar, however, I would like to be able to do this across a cluster that can infer with high throughput. Any suggestions on how best to accomplish this?

โ€‹

A naive solution would be having multiple instances of text-generation-web-ui running in a cluster and distributing requests to each instance. My gut says there's a more ideal method that I can use.

top 1 comments
sorted by: hot top controversial new old
[โ€“] mcmoose1900@alien.top 1 points 11 months ago

Llama.cpp's example server supports batching and custom grammar.

Its a work in progress for Aphrodite: https://github.com/PygmalionAI/aphrodite-engine/issues/36#issuecomment-1747429134