LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM (github.com)

submitted 2 years ago by rihard7854@alien.top to c/localllama@poweruser.forum

23 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Aaaaaaaaaeeeee@alien.top 1 points 2 years ago (3 children)

(With a massive batch size*)

Its would be better if they provide single batch information for normal inference on fp8.

People look at this and think its astonishing, but will compare this with single batch performances as that's all they have seen before.

[–] CocksuckerDynamo@alien.top 1 points 2 years ago (1 children)

Its would be better if they provide single batch information for normal inference on fp8.

better for who? people that are just curious or people that are actually going to consider buying H200s?

who is buying a GPU that costs more than a new car and using it for single batch?

[–] Aaaaaaaaaeeeee@alien.top 1 points 2 years ago (1 children)

Its useful for people who want to know the inference response time.

This wouldn't give us a 4000 ctx reply in 1/3 of a second.

[–] CocksuckerDynamo@alien.top 1 points 2 years ago

Its useful for people who want to know the inference response time.

No, it's useful for people who want to know the inference response time with batch size 1, which is not something that prospective H200 buyers care about. Are you aware that deployments in business environments for interactive use cases such as real time chat generally use batching? Perhaps you're assuming request batching is just for offline / non-interactive use, but that isn't the case.

load more comments (1 replies)