Hi all,
I need a help from all of you.
I am going to buy H100s for training LLMs. Currently for fine-tuning 70b models but later we may consider pre-training larger models too.
H100s looks more promising than A100s considering its power of FP8 support, so I asked quotes from multiple vendors. And then, realized there are too many options!
-
DGX - 8x H100, much more expensive than other options but they say its performance is worth it.
-
Buy PCI-E H100 cards and a Supermicro machine - from 2x upto 8x, looks cost effective.
2.a. some vendors offered a combination with NVLinks. Some says 1 link is needed for 2 and some says 3 links are needed for 2.
-
H100 NVL - no idea what the difference is compared to the PCI-E with NVLinks but looks like they are newly introduced ones.
-
Some other options, like a custom build made by the vendors.
Any BEST PRACTICE I can take a look to make a decision? Any advice from experts here who suffered a similar situation already?
Thanks in advance ๐
Yeah, it is not my money but still stressful
Wait, whos money is it? Can't you just rent as well?
Can be hard to rent if all the capacity is bought out. But if it's just 1 DGX then they might be better off renting.
I tried to rent from LambdaLabs yesterday but there was no availability for any gpu