MonkeyMaster64

joined 1 year ago

ExLlamaV2: The Fastest Library to Run LLMs in c/localllama@poweruser.forum

[–] MonkeyMaster64@alien.top 1 points 11 months ago

Is this able to use CPU (similar to llama.cpp)?

permalink
fedilink
source

Large-scale LLM deployment with GBNF support (alien.top)

submitted 11 months ago by MonkeyMaster64@alien.top to c/localllama@poweruser.forum

1 comments fedilink

Hey guys, as the title suggests I'd like some advice on the best way to serve LLMs with the support of GBNF or similar to ensure that I receive deterministic output. I have been using text-generation-web-ui locally and from there I can add my grammar, however, I would like to be able to do this across a cluster that can infer with high throughput. Any suggestions on how best to accomplish this?

A naive solution would be having multiple instances of text-generation-web-ui running in a cluster and distributing requests to each instance. My gut says there's a more ideal method that I can use.