How we made Git work great for machine learning
Git is a great tool for versioning and collaborating on ML code and notebooks. However, Git is typically a poor solution for large binary files like the predictive and generative models created by ML frameworks. These binary models are just as important as the code that produced them, but they fit poorly in Git: what is an ML team to do?
In this post we’ll explain the customizations we’ve added to Git that makes using Git for both code and models a great experience.
What’s the goal?
Before we get into the solution, let’s get clear about what we want. We want to track all ML model development, deployment, and versioning with Git so we can use merge requests, branching, and rollbacks. Of course this includes the binary model files, not just the code.
We want Git to work as well for Data Scientists and ML Engineers as it does for software engineers. So why doesn’t it?
Where does Git fall short?
Git was designed for text files. Specifically, source code files which have short lines of text. ML projects have a lot of non-code assets like pickled models and large checkpoint files and these aren’t Git-friendly.
While Git can store binary files, they stick around forever in the repository’s local history. If you’ve ever tried to get a giant file deleted from Git’s history after Jimmy “accidentally” checked it in, you know the pain.
Large files in Git means every future "git clone" or "git pull" can take hours and will fill up your hard drive with old versions of models you’ll probably never need. And the binary files in Git are also useless in code reviews which mostly defeats the purpose of tracking these files in Git in the first place.
Git-ops workflows are also challenging for ML projects. A typical ML project is likely to have historical versions of deployed models running for weeks, maybe months, to A/B test. Software engineers using Git, on the other hand, typically migrate from one version of a service to another quite rapidly. The idea that "HEAD" is the singular “current” version makes much more sense in software projects than it does in ML projects.
When it comes to ML models, standard Git gets slow and bloated, disappoints in code reviews, and isn’t designed for the kind of concurrent versioning that ML projects expect. That’s why we made Git better.
Our solution
We use Git for tracking code, models, and other ML assets. To make this work smoothly we don’t store the binary files in Git. Instead, the binary files are automatically uploaded to S3 during "git add" and a “pointer file” is stored in the Git repository instead.
The Git-LFS project does this, but we decided to write our own because Git-LFS is quite limited. Git-LFS doesn’t work on 2GB+ files on GitHub, which is a non-starter for ML projects since many PyTorch and LLM checkpoints are larger. It also doesn’t encrypt the files before storing them in S3, which is a compliance issue for many businesses. Nor does it compress files before uploading them, which slows down Git pushes and pulls of large files.
We took inspiration from Git-LFS and made something better for Data Scientists and ML Engineers.
Automatic encryption and S3 upload during "git add"
The first part of the improvement comes when we transform binary files on their way into, or out of, the Git repository.
We store binary files in S3 without size limitations, and with encryption and compression using Git’s smudge and clean filters. These “filters” let us act on the files during Git pull and push, transparently to the user experience. When you run "git add" our “clean” filter encrypts and uploads the file to S3 and stores a pointer file in Git. Likewise in the reverse, when you "git pull" our “smudge” filter sees the pointer file and replaces it with the file downloaded and decrypted from S3. And these filters are only run on the current commit, so historical versions don’t slow your "git pull"s or take up space on your hard drive.
Here’s an example showing what happens to a pickled model file that you check it into your Git repository with "git add". The file is transformed by the “clean” filter, storing its binary form in S3 and a pointer file in Git:
This flow allows us to use Git for storing text files and S3 for storing binary files, which combines both their strengths into one polished experience.
The pointer files we store in Git aren’t simply S3 URLs or file hashes. They’re much better.
Describing binary objects for better merge requests
Uploading binary files to S3 solves part of the problem by keeping large binary files out of Git. But we still need to make Git useful for these binary files, especially during code reviews.
Of course, a pointer file containing only a hash of a binary file is about as useless as the binary file itself when it comes to code reviews. That’s why we use a variety of techniques to figure out what’s inside the binary file and add that description to the pointer file in Git. Here’s an example pointer file showing that it points to an XGBoost model:
That’s a lot more useful in a code review than the original binary file!
These two changes make Git a great experience for ML models and other binary files, so we kept going with other ML resources, like our model registry.
Making a branch-aware model registry
Most model registries are stored in a SQL database. It’s a natural place for models and their metadata if you think of the registry as external to the model development workflow. But is the registry really independent from the Git repository storing the models and the code that makes them?
What if your model registry was backed by Git, instead of SQL? Then you could have branch-aware registries, and test changes to your model registry in your staging environment. A Git-backed registry also means traveling back in time to know what models were running last quarter would be easy. Protected branches means Jimmy cannot break your registry by accident, again. Best of all, you could review model registry changes in the same merge request that contained changes to the inference code using the models!
Since inference code and models can be successfully stored and versioned in Git, that’s where we chose to build our model registry as well. We believe that the registry is as much a part of Git as the models and code that made them.
Our model registries store content like the above pointer file. This way, when working on a branch you can add/edit/remove models in the registry and present the entire change at once in a merge request.
Once you’ve tried a complete Git workflow, from model development to registry changes in shared merge requests, it’ll be hard to go back.
Going operational
With our Git repository ready for ML development the last step is using it for deployment.
Our Git repositories trigger events whenever files change. Just like a CI/CD pipeline, we listen for those events and create new versions of deployments, update model registries, or change DNS records to alias one version to another.
And instead of treating the latest commit in Git as the only “current” version, we treat every version as idempotent. For example, any change to an ML model’s endpoint source code creates a new version of that endpoint, with its own URL. This flow makes keeping many concurrent versions of a deployed model around for A/B tests a natural fit.
By storing everything in Git we can make complex infrastructure changes safe, reviewable, and undoable.
Git is great for ML
With the changes outlined above, Git can be a great tool for Data Scientists and ML Engineers. We’ve built all of this (and more stuff) into Modelbit.
I've used DVC (data version control) in the past to do most of this, and I bet there are other great tools out there as well. I found it extremely useful for projects where data and models are constantly changing between iterations. I don't think you can do the whole informative pointer thing that your tool has though, which is pretty cool. Typically in my workflows with dvc those parameters would usually be strictly defined in a params file or the dvc.yml file which would be in plain text in the git repo. Cool project
Absolutely! In a past life working on video games, we used Perforce because it had much better support for large art assets. In this life, though, we find that ML teams are often part of larger tech engineering teams that have already decided on Git, so that's a constraint we took to heart. (also, thank you!)