A development team in your organization is using a monorepo and it's became quite big, including hundred thousands of files. They say running many git operations is taking a lot of time to run (like git status for example). Why does that happen and what can you do in order to help them?

Difficulty: unrated

Source: bregman-arie/devops-exercises by Arie Bregman

Answer

Many Git operations are related to filesystem state. git status for example will run diffs to compare HEAD commit to index and another diff to compare index to working directory. As part of these diffs, it would need to run quite a lot of lstat() system calls. When running on hundred thousands of files, it can take seconds if not minutes.

One thing to do about it, would be to use the built-in fsmonitor (filesystem monitor) of Git. With fsmonitor (which integrated with Watchman), Git spawn a daemon that will watch for any changes continuously in the working directory of your repository and will cache them . This way, when you run git status instead of scanning the working directory, you are using a cached state of your index.

Next, you can try to enable feature.manyFile with git config feature.manyFiles true. This does two things:

Sets index.version = 4 which enables path-prefix compression in the index
Sets core.untrackedCache=true which by default is set to keep. The untracked cache is quite important concept. What it does is to record the mtime of all the files and directories in the working directory. This way, when time comes to iterate over all the files and directories, it can skip those whom mtime wasn't updated.

Before enabling it, you might want to run git update-index --test-untracked-cache to test it out and make sure mtime operational on your system.

Git also has the built-in git-maintainence command which optimizes Git repository so it's faster to run commands like git add or git fatch and also, the git repository takes less disk space. It's recommended to run this command periodically (e.g. each day).

In addition, track only what is used/modified by developers - some repositories may include generated files that are required for the project to run properly (or support certain accessibility options), but not actually being modified by any way by the developers. In that case, tracking them is futile. In order to avoid populating those file in the working directory, one can use the sparse checkout feature of Git.

Finally, with certain build systems, you can know which files are being used/relevant exactly based on the component of the project that the developer is focusing on. This, together with the sparse checkout can lead to a situation where only a small subset of the files are being populated in the working directory. Making commands like git add, git status, etc. really quick