LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

communick@poweruser.forum

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM (github.com)

submitted 10 months ago by rihard7854@alien.top to c/localllama@poweruser.forum

23 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] drplan@alien.top 1 points 10 months ago (1 children)

Perfect. Next please a chip that can do half the inference speed of an A100 with 15 Watts power.

[–] MrTacobeans@alien.top 1 points 10 months ago

I don't think that will come from Nvidia. It's going to take in memory compute to get anywhere near that level of efficiency. First samples of these SOCs are no where near the memory requirements needed even for small models. These type of accelators will likely come from Intel/arm/risc/amd before Nvidia does it.