LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM (github.com)

submitted 2 years ago by rihard7854@alien.top to c/localllama@poweruser.forum

23 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] FullOf_Bad_Ideas@alien.top 1 points 2 years ago (4 children)

It may be a stupid question, but how is it possible to generate faster than once per read? Assuming 4800 GB bandwidth and 13GB GB q8 Llama 2 13B, model can be read about 370 times per second, limiting the max generation speed at 370/s. How are they going faster than that? Does batch size x generation mean that it's generating for x amount of users at once, but every user sees only a fraction of that on their screen?

[–] lengyue233@alien.top 1 points 2 years ago (3 children)

Yes, batch size is intended for multiple sessions (1024 parallel sessions in the case).

[–] FullOf_Bad_Ideas@alien.top 1 points 2 years ago (2 children)

Measly 12t/s then. I mean that's great for hosting your own llm's if you are a business - awesome cost savings, since you only need an 8-pack of those and you can serve about 20-80k concurrent users, given that most of the time they are reading the replies and not reply with new context immediately. For people like us who don't share the gpu, it doesn't make much sense outside of rare cases. Do you by any chance know how I could set a kobold-like completion api that does batch size of 4/8? I want to create a synthetic dataset based on certain provided context, locally. I was doing it via batch size of 1 so far, but I have enough spare vram now that I should he able to up my batch size. Is it possible with autoawq and oobabooga webui? Does it quickly run into cpu bottleneck?

[–] ZenEngineer@alien.top 1 points 2 years ago

There was a paper where you'd return a faster model to come up with a sentence and then basically run a batch on them big model with each prompt being the same sentence, with different lengthsending in a different word predicted by the small model, to basically see where the small one went wrong. That gets you a speed up if the two models are more or less aligned.

Other than that I could imagine other things, like having batches with one sentence being generated for each actor, one for descriptions, one for actions, etc. Or simply multiple options for you to choose.

load more comments (1 replies)