ZenEngineer

joined 2 years ago

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM in c/localllama@poweruser.forum

[–] ZenEngineer@alien.top 1 points 2 years ago

There was a paper where you'd return a faster model to come up with a sentence and then basically run a batch on them big model with each prompt being the same sentence, with different lengthsending in a different word predicted by the small model, to basically see where the small one went wrong. That gets you a speed up if the two models are more or less aligned.

Other than that I could imagine other things, like having batches with one sentence being generated for each actor, one for descriptions, one for actions, etc. Or simply multiple options for you to choose.

permalink
fedilink
source
context