this post was submitted on 13 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
top 23 comments
sorted by: hot top controversial new old
[–] Aaaaaaaaaeeeee@alien.top 1 points 10 months ago (1 children)

(With a massive batch size*)

Its would be better if they provide single batch information for normal inference on fp8.

People look at this and think its astonishing, but will compare this with single batch performances as that's all they have seen before.

[–] CocksuckerDynamo@alien.top 1 points 10 months ago (1 children)

Its would be better if they provide single batch information for normal inference on fp8.

better for who? people that are just curious or people that are actually going to consider buying H200s?

who is buying a GPU that costs more than a new car and using it for single batch?

[–] Aaaaaaaaaeeeee@alien.top 1 points 10 months ago (1 children)

Its useful for people who want to know the inference response time.

This wouldn't give us a 4000 ctx reply in 1/3 of a second.

[–] CocksuckerDynamo@alien.top 1 points 10 months ago

Its useful for people who want to know the inference response time.

No, it's useful for people who want to know the inference response time with batch size 1, which is not something that prospective H200 buyers care about. Are you aware that deployments in business environments for interactive use cases such as real time chat generally use batching? Perhaps you're assuming request batching is just for offline / non-interactive use, but that isn't the case.

[–] Longjumping-Bake-557@alien.top 1 points 10 months ago

And that's on a die just slightly bigger than the 4090. Unless they increased the size compared to h100?

[–] ninjasaid13@alien.top 1 points 10 months ago

H200? Theres a new accelerator?

[–] The_Hardcard@alien.top 1 points 10 months ago

That’s the speed of 4.8 TB/s memory bandwidth. 5.3 TB/s coming in a little over three weeks.

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago (1 children)

It may be a stupid question, but how is it possible to generate faster than once per read? Assuming 4800 GB bandwidth and 13GB GB q8 Llama 2 13B, model can be read about 370 times per second, limiting the max generation speed at 370/s. How are they going faster than that? Does batch size x generation mean that it's generating for x amount of users at once, but every user sees only a fraction of that on their screen?

[–] lengyue233@alien.top 1 points 10 months ago (1 children)

Yes, batch size is intended for multiple sessions (1024 parallel sessions in the case).

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago (2 children)

Measly 12t/s then. I mean that's great for hosting your own llm's if you are a business - awesome cost savings, since you only need an 8-pack of those and you can serve about 20-80k concurrent users, given that most of the time they are reading the replies and not reply with new context immediately. For people like us who don't share the gpu, it doesn't make much sense outside of rare cases. Do you by any chance know how I could set a kobold-like completion api that does batch size of 4/8? I want to create a synthetic dataset based on certain provided context, locally. I was doing it via batch size of 1 so far, but I have enough spare vram now that I should he able to up my batch size. Is it possible with autoawq and oobabooga webui? Does it quickly run into cpu bottleneck?

[–] JustOneAvailableName@alien.top 1 points 10 months ago

For people like us who don't share the gpu, it doesn't make much sense outside of rare cases.

Multiple agents talking to each other. Quickly parsing a knowledge base. Sampling methods like: tree of thought, the simple old beam search, or using multiple prompts.

I don't want to spend that amount of money, but I definitely want play on one for a few months.

[–] ZenEngineer@alien.top 1 points 10 months ago

There was a paper where you'd return a faster model to come up with a sentence and then basically run a batch on them big model with each prompt being the same sentence, with different lengthsending in a different word predicted by the small model, to basically see where the small one went wrong. That gets you a speed up if the two models are more or less aligned.

Other than that I could imagine other things, like having batches with one sentence being generated for each actor, one for descriptions, one for actions, etc. Or simply multiple options for you to choose.

[–] a_beautiful_rhind@alien.top 1 points 10 months ago

70b with 2048 context and 128 reply is about 303 t/s.

That sounds more reasonable. And assuming they aren't quantized. The batch size is just theoretical batch I think.

[–] yamosin@alien.top 1 points 10 months ago (1 children)

H100 price is 30,000 dollars so i guess this one will be 70,000

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago

The same bench on H100 gives about 9000 tokens. And you can rent H100 for $5/h on runpod.

[–] drplan@alien.top 1 points 10 months ago (1 children)

Perfect. Next please a chip that can do half the inference speed of an A100 with 15 Watts power.

[–] MrTacobeans@alien.top 1 points 10 months ago

I don't think that will come from Nvidia. It's going to take in memory compute to get anywhere near that level of efficiency. First samples of these SOCs are no where near the memory requirements needed even for small models. These type of accelators will likely come from Intel/arm/risc/amd before Nvidia does it.

[–] lengyue233@alien.top 1 points 10 months ago

Are u going to talk with Yuki at 1024 batch size?

[–] Useful_Hovercraft169@alien.top 1 points 10 months ago

Who gives a shit I can’t read that fast

[–] MeMyself_And_Whateva@alien.top 1 points 10 months ago

If it's a version available for a tenth of the price, I could settle for 1,200 t/s without problems.

[–] aliencaocao@alien.top 1 points 10 months ago (1 children)

Batchsize 1024 though...not for personal use case

[–] Herr_Drosselmeyer@alien.top 1 points 10 months ago

Obviously. There aren't many people in the world with 50k burning a hole in their pockets and of those, even fewer are nerdy enough to want to set up their own AI server in their basement just for themselves to tinker with.

[–] jun2san@alien.top 1 points 10 months ago

How much you want for your old H100? - me to ai devs