this post was submitted on 13 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[โ€“] Aaaaaaaaaeeeee@alien.top 1 points 10 months ago (1 children)

Its useful for people who want to know the inference response time.

This wouldn't give us a 4000 ctx reply in 1/3 of a second.

[โ€“] CocksuckerDynamo@alien.top 1 points 10 months ago

Its useful for people who want to know the inference response time.

No, it's useful for people who want to know the inference response time with batch size 1, which is not something that prospective H200 buyers care about. Are you aware that deployments in business environments for interactive use cases such as real time chat generally use batching? Perhaps you're assuming request batching is just for offline / non-interactive use, but that isn't the case.