Git filtering mechanism isn't intended to be used the way Git LFS has used it

bk2204 · on Oct 27, 2020

Hi, one of the maintainers of Git LFS here. I'm also a Git contributor, and I don't think it's a secret or in any way shocking to most of the Git contributors that the intended purpose of the smudge/clean filter functionality in Git was to perform simpler modifications on source code.

However, one of the benefits to Git is that it is enormously flexible, and Git can be and is successfully leveraged to provide large file functionality using this mechanism. Before I maintained Git LFS or knew how it worked, I hypothesized that this would be the ideal mechanism to do handling of large files, so it shouldn't be surprising that the original developers decided to do so. We just need to be cognizant that as with any design, there are going to be some limitations, which is what I was mentioning in that thread.

As a side note, it's intentional that we don't use hard links or symlinks into the LFS storage because that makes it extremely easy to destroy or corrupt data by modifying the working tree, so Git's behavior here is actually helpful. There is copy-on-write functionality in Git LFS that can be used if your file system supports it to make the size of the repository a little less painful with large files.

dominicl · on Oct 27, 2020

Would love to hear some insights from game developers or others dealing with many binary media files. Is this what git lfs is used for or are there other popular tools in the scene?

actionowl · on Oct 27, 2020

I've personally only seen git lfs used as a bad-habit enabler (e.g. storing .dll, .so, and .exe files within a repo).

IMHO those artifacts should be stored elsewhere. Anyone have some better examples of how it's used?

hansvs · on Oct 27, 2020

I'm part of a compiler research group and we use GIT LFS to maintain history of the binary releases of our compiler (sac2c). Our perspective on this is that we can more closely relate the binaries to release notes, bug reports, etc. by leveraging systems like Gitlab. To give you an idea, here's our repo: https://gitlab.science.ru.nl/sac-group/sac-packages/.

EDIT: reword

alcolade · on Oct 27, 2020

Can you suggest where those artifacts should be stored?

I often have projects that require third party libraries, and perhaps I change or upgrade the version of the library I use during the history of my project. I would like to have everything versioned together so if I check out a certain commit, I know that everything will work.

The downside of this is that my git repo will get bloated over time if I keep changing large dlls. I've sometimes made submodules to mitigate this problem. Git lfs also seemed like a good solution, but your comment makes me feel like I'm doing something bad here...

WorldMaker · on Oct 27, 2020

You could use a private feed to a package manager that supports arbitrary ZIP packages. You can (ab)use most package managers this way (npm, NuGet, Maven, etc). Azure DevOps Artifacts also has a very simple, dumb "Universal Package" manager that just attaches a version number to a ZIP file and really doesn't care what is inside it.

Then you just use the usual sorts of package management tools (including locks) to keep everything versioned together.

alcolade · on Oct 27, 2020

Thank you for your suggestions

actionowl · on Oct 27, 2020

> Can you suggest where those artifacts should be stored?

It depends on where you get them. If you're using a compiled object that you get from a 3rd party and you don't have the source or don't build them then I'm not sure what to tell you. You're at the mercy of your vendor anyway so my other arguments against it won't make a difference.

alcolade · on Oct 27, 2020

I see, yes I usually resort to this when I can't build it myself.

rmedaer · on Oct 27, 2020

IMO it might be a good idea to store binary assets in Git from a "history management" point of view.

However it's a mess when you have to deal with a lot of huge files. Not especially because of bad-habit but because it's hard to manage on day-to-day and you will face a lot of issues.

One of the issues I got recently is the time required to switch between branches (if LFS-tracked files are changing). It is due to the "checkout" and how git-lfs works (see the Github issue). Actually it copies the file... This operation takes time... Instead it could use hard-link or symbolic-link.

actionowl · on Oct 27, 2020

> IMO it might be a good idea to store binary assets in Git from a "history management" point of view.

You can't diff them, just see that they changed. If you have versions in the filename of the object that helps but isn't great from an auditing perspective. People change filenames, and if someone alters or tampers with the file it's hard to tell. You might argue that this is also possible if you fetch the file somewhere or install a package and link against it, but you have a better audit trail, packages can (should) also be signed.

detaro · on Oct 27, 2020

Pulling resources in from other places still has to solve the same problems as git-lfs. What specifically makes resources stored elsewhere less of a bad-habit enabler?

actionowl · on Oct 27, 2020

Git-lfs isn't built into git, it's a separate thing you need to go find and download, or pull in from your package manager. It's a hack on top of git.

detaro · on Oct 27, 2020

Sure. How does that relate to my comment?

Either you need (large, not suitable for inclusion or submodules) resources from outside the repo or you don't. If you don't, don't do it. If you do, git-lfs is to me a valid choice and the specific best choice depends on workflow details, and to me it doesn't have any obvious potential to be misused more.