LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM (github.com)

submitted 2 years ago by rihard7854@alien.top to c/localllama@poweruser.forum

23 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] ZenEngineer@alien.top 1 points 2 years ago

There was a paper where you'd return a faster model to come up with a sentence and then basically run a batch on them big model with each prompt being the same sentence, with different lengthsending in a different word predicted by the small model, to basically see where the small one went wrong. That gets you a speed up if the two models are more or less aligned.

Other than that I could imagine other things, like having batches with one sentence being generated for each actor, one for descriptions, one for actions, etc. Or simply multiple options for you to choose.