this post was submitted on 18 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 10 months ago
MODERATORS
 

Hey r/MachineLearning!

Last year, u/rajatarya showcased how we scaled Git to handle large datasets. One piece of feedback we kept getting is that people didn't want to move their source code over to XetHub.

So we built a GitHub app & integration that lets you continue storing code in GitHub while XetHub handles the large datasets & models.

https://about.xethub.com/blog/xetdata-scale-github-repos-100-tb

We've enjoyed using it to host open source LLM's like Llama2 and Mistral with our finetuning code side-by-side.

The whole thing is in beta so we're eager for any feedback you have to offer :)

you are viewing a single comment's thread
view the rest of the comments
[–] 4_love_of_Sophia@alien.top 1 points 10 months ago (1 children)

How’s it different from LFS for model files and DVC for data?

[–] semicausal@alien.top 1 points 10 months ago

Good questions:

- DVC: no new commands to learn (we extend Git) and you don't need S3.

- Git LFS: we inject useful views into your large files inside GitHub itself (in commits and PR's) unlike Git LFS (e.g. check this model diff: https://youtu.be/lAyymscJUvI?t=87), we scale to much larger sizes (100 terabytes), and we deduplicate better (Git LFS considers a 1 line change to a large CSV file a new entire file, our technique captures the differences)