Hi all,
I need a help from all of you.
I am going to buy H100s for training LLMs. Currently for fine-tuning 70b models but later we may consider pre-training larger models too.
H100s looks more promising than A100s considering its power of FP8 support, so I asked quotes from multiple vendors. And then, realized there are too many options!
-
DGX - 8x H100, much more expensive than other options but they say its performance is worth it.
-
Buy PCI-E H100 cards and a Supermicro machine - from 2x upto 8x, looks cost effective.
2.a. some vendors offered a combination with NVLinks. Some says 1 link is needed for 2 and some says 3 links are needed for 2.
-
H100 NVL - no idea what the difference is compared to the PCI-E with NVLinks but looks like they are newly introduced ones.
-
Some other options, like a custom build made by the vendors.
Any BEST PRACTICE I can take a look to make a decision? Any advice from experts here who suffered a similar situation already?
Thanks in advance ๐
Tbh I don't really see how this explains anything. Sure, OP doesn't go bankrupt buying it for the company but I'm 99% certain that it's still a bad financial decision.