MonkeyMaster64

joined 1 year ago
[โ€“] MonkeyMaster64@alien.top 1 points 11 months ago

Is this able to use CPU (similar to llama.cpp)?

 

Hey guys, as the title suggests I'd like some advice on the best way to serve LLMs with the support of GBNF or similar to ensure that I receive deterministic output. I have been using text-generation-web-ui locally and from there I can add my grammar, however, I would like to be able to do this across a cluster that can infer with high throughput. Any suggestions on how best to accomplish this?

โ€‹

A naive solution would be having multiple instances of text-generation-web-ui running in a cluster and distributing requests to each instance. My gut says there's a more ideal method that I can use.