this post was submitted on 31 Oct 2023

1 points (100.0% liked)

LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

I am going to buy H100s. There are too many options. (alien.top)

submitted 2 years ago by OldPin8654@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

Hi all, I need a help from all of you. I am going to buy H100s for training LLMs. Currently for fine-tuning 70b models but later we may consider pre-training larger models too. H100s looks more promising than A100s considering its power of FP8 support, so I asked quotes from multiple vendors. And then, realized there are too many options!

DGX - 8x H100, much more expensive than other options but they say its performance is worth it.
Buy PCI-E H100 cards and a Supermicro machine - from 2x upto 8x, looks cost effective.

2.a. some vendors offered a combination with NVLinks. Some says 1 link is needed for 2 and some says 3 links are needed for 2.

H100 NVL - no idea what the difference is compared to the PCI-E with NVLinks but looks like they are newly introduced ones.
Some other options, like a custom build made by the vendors.

Any BEST PRACTICE I can take a look to make a decision? Any advice from experts here who suffered a similar situation already? Thanks in advance 🙏

top 22 comments

sorted by: hot top controversial new old

[–] qrios@alien.top 1 points 2 years ago (3 children)

I realize this is totally unhelpful but, the DGX - 8x H100 costs just slightly more than the median price of a new house in the US . . .

I'm not saying this is a poor decision but . . . man that is one hell of a decision.

[–] FaustBargain@alien.top 1 points 2 years ago (1 children)

if it's a company that could be a drop in the bucket

[–] OldPin8654@alien.top 1 points 2 years ago

Yes! Put more money in it, the company!!!

[–] Herr_Drosselmeyer@alien.top 1 points 2 years ago (3 children)

OP isn't buying them for his personal setup, though that would be a baller move.

[–] OldPin8654@alien.top 1 points 2 years ago (1 children)

Yeah, it is not my money but still stressful

[–] Acceptable_Can5509@alien.top 1 points 2 years ago (2 children)

Wait, whos money is it? Can't you just rent as well?

[–] tvetus@alien.top 1 points 2 years ago

Can be hard to rent if all the capacity is bought out. But if it's just 1 DGX then they might be better off renting.

[–] Slimxshadyx@alien.top 1 points 2 years ago

I tried to rent from LambdaLabs yesterday but there was no availability for any gpu

[–] donotdrugs@alien.top 1 points 2 years ago

OP isn't buying them for his personal setup

Tbh I don't really see how this explains anything. Sure, OP doesn't go bankrupt buying it for the company but I'm 99% certain that it's still a bad financial decision.

[–] nero10578@alien.top 1 points 2 years ago

Definitely thought this was for his homelab

[–] HaywireVRV@alien.top 1 points 2 years ago

My friend’s company has a bunch of DGX idling for months. Ain’t that something.

[–] JustOneAvailableName@alien.top 1 points 2 years ago (2 children)

H100 in the DGX is not the H100 PCI-e, but about 30% faster. When in doubt, just go DGX

[–] OldPin8654@alien.top 1 points 2 years ago (1 children)

I will talk to my boss for more money 😆

[–] aadoop6@alien.top 1 points 2 years ago

Out of curiosity, what kind of projects are you working on that require purchasing such GPUs rather than renting on the cloud?

[–] fadenb@alien.top 1 points 2 years ago

Nooooo! DGX you pay for the name and "service" by Nvidia. PCIe is lacking fast interconnect with nvswitch. There is a layer in between: HGX.,it's basically DGX without the branding.

You can get such systems from Supermicro and ASUS

[–] etherd0t@alien.top 1 points 2 years ago (1 children)

where you buying from, eBay? there are no reputable sellers, atm

[–] OldPin8654@alien.top 1 points 2 years ago

Not living in the US atm but no reputable sellers neither here 😂

[–] a_beautiful_rhind@alien.top 1 points 2 years ago

Supermicro makes SXM using servers for H100 I think. So you don't have to buy PCIE H100s or be forced to use the DGX.

[–] BreakIt-Boris@alien.top 1 points 2 years ago

You will be lucky to find a supplier who does not have a long waiting list. The demand in the enterprise sector is real and I'm calling BS on any supplier having stock before Q2 2024.
You have done absolutely no research or even begun to look into the architecture and capability of the hardware you are discussing. If you have seriously been given the task to choose a hardware platform for your company then I worry for your companies future. There is a reason system architects in large organisations get paid a lot.

If you are fine tuning you MAY get away with a NVLinkd pair of H100s if running smaller models, however you will be massively nerfed for any 'proper' work, and certainly have no chance of training your own model.

NVLink gets a bad name. It shouldn't. Think of it as PCIE on steroids, connecting all devices so they don't have to touch the PCIE bandwidth. Or even more valuable, not requiring CPU cycles and instead being able to directly communicate with each other. Saving a massive amount of latency as well as general optimisation.

The SXM options are the best bet for serious work due to their interconnectivity capabilities. The PCIE devices are essentially sxm modules on a PCB with a massive power limit applied to minimise overheating or cooling issues. PCIE - 250w / SXM - 450W .

And that's not even touching on the use of infiband or other compatible fabrics for direct compute access from connected devices ( again skipping CPU cycles and communication ). RDMA ftw.

So again, I'm calling BS. Usually I just smile and move on when reading another fantasists bs story that never turns out to result in anything. However they are becoming more and more common, especially on this sub.

If I am wrong, I apologise profusely. As stated previously, If you are honestly the member of staff that has been put in charge of a procurement decision like this then I truly feel sorry for whoever you work for.

[–] petitponeyrose@alien.top 1 points 2 years ago

Hello,
Can you give an estimation of what the prices are ?

[–] buildsmol@alien.top 1 points 2 years ago

If you need help with other vendors check out https://gpumonger.com

Gathers all pricing from all cloud gpu vendors.

[–] ZeeKayNJ@alien.top 1 points 2 years ago

Do tell us how you intend to train models. Specifically, which open source projects you're using, frameworks etc. Some of us are borrowing these cards with blood and sweat :)