this post was submitted on 14 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 11 months ago
MODERATORS
 

https://higgsfield.ai
We have a massive GPU cluster and developed our own infrastructure to manage the cluster and train massive models.

There's how it works:

  1. You upload the dataset with preconfigured format into HuggingFaсe [1].
  2. Choose your LLM (e.g. LLaMa 70B, Mistral 7B)
  3. Place your submission into the queue
  4. Wait for it to get trained.
  5. Then you get your trained model there on HuggingFace.

Essentially, why would we want to do it?

  1. We already have an experience with training big LLMs.
  2. We could achieve near-perfect infrastructure performance for training.
  3. Sometimes GPUs have just nothing to train.

Thus we thought it would be cool if we could utilize our GPU cluster 100%. And give back to Open Source community (already built an e2e distributed training framework [2]).

This is in an early stage, so you can expect some bugs.

Any thoughts, opinions, or ideas are quite welcome!

[1]: https://github.com/higgsfield-ai/higgsfield/blob/main/tutori...

[2]: https://github.com/higgsfield-ai/higgsfield

you are viewing a single comment's thread
view the rest of the comments
[–] 0zyman23@alien.top 1 points 10 months ago

Wow, you guys are the best, could you also add estimated time for my run to start, thinking if i ll get something in meaningful time, but the mere fact things like this exist is great