Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[P] Versioning code & large models together in GitHub (alien.top)

submitted 2 years ago by semicausal@alien.top to c/machinelearning@academy.garden

2 comments fedilink hide all child comments

Hey r/MachineLearning!

Last year, u/rajatarya showcased how we scaled Git to handle large datasets. One piece of feedback we kept getting is that people didn't want to move their source code over to XetHub.

So we built a GitHub app & integration that lets you continue storing code in GitHub while XetHub handles the large datasets & models.

https://about.xethub.com/blog/xetdata-scale-github-repos-100-tb

We've enjoyed using it to host open source LLM's like Llama2 and Mistral with our finetuning code side-by-side.

The whole thing is in beta so we're eager for any feedback you have to offer :)

you are viewing a single comment's thread
view the rest of the comments

[–] 4_love_of_Sophia@alien.top 1 points 2 years ago (1 children)

How’s it different from LFS for model files and DVC for data?

[–] semicausal@alien.top 1 points 2 years ago

Good questions:

- DVC: no new commands to learn (we extend Git) and you don't need S3.

- Git LFS: we inject useful views into your large files inside GitHub itself (in commits and PR's) unlike Git LFS (e.g. check this model diff: https://youtu.be/lAyymscJUvI?t=87), we scale to much larger sizes (100 terabytes), and we deduplicate better (Git LFS considers a 1 line change to a large CSV file a new entire file, our technique captures the differences)