Should we split git projects across multiple repositories?

By: on January 28, 2019

Recommendations

  • Keep your project source code in a single repository for simplicity as long as you reasonably can while keeping your developers working efficiently.
  • Invest in your build to keep your median automated build and test time as fast as possible – 15 minutes is good, 1 minute is better.
  • Keep large binaries out and when you consume third party repositories reference them instead of copying them into your repository.

Background

Git has won the source code management war. The google search query trend is clear:

Almost every developer applying for a job lists git on their CV. All the metrics show git dominating. Even Microsoft and Perforce, who had popular source code management tools, now see git as a big part of their strategy. For me, the big open topic around source code management has moved on to how to use git for larger projects with multiple pieces that are related though could be built and deployed separately. When you start off with just a few small closely related text files forming a greenfield project people it is easy to put a repository up on Github, and provided you keep large files out of your repository it’ll probably work out fine for you.

At some point you may end up with two or more discrete parts of your project. For example, you may have produced a web service with a site and an API, then decide to work on an iOS app to provide for mobile users. Or, you may have made a library you want to share between several programs. Possibly you’ve used some open source code and have made changes to it which haven’t yet been merged into the original project. You can either choose now to put each component in its own repository, or add them all in the same single repository? I have organised source code management on projects that took both approaches.

If you choose to use many small related repositories

Let’s suppose you decide to work with multiple repositories and think through the implications. We can visualise the project as a collection of standalone trees:

Changes can be made to individual repositories. When you deploy and test a system you’ll normally want to recall which version of each repository were used. Here are some options:

Versioning using a build timestamp

You could record the time when you checked out the source code of each repository and use that to work out what in to each build. That’s problematic since:

  1. If you allow merge commits in your branch history then each merge makes it ambiguous how the branch was at times before the merge.
  2. Git operation times are determined on each developer machine, which may not have accurately synchronised clocks.
  3. Continuous integration systems often do not check out each repository at exactly the same time.

Versioning using a manifest file

One option is to store the version of each repository you need in a file, either as source code in one repository or in each build artifact. One of the larger multi repository projects is Android, and so Google developed Repo as a command line tool to work with manifest files to handle multiple repository operations and a web interface for code review across multiple repositories called Gerrit. Note this means that your developers are effectively using repo/gerrit as their source code management system, and those tools then use git. With this approach the tool will generally check out the entire forest of git repositories, and on some projects that can be prohibitively expensive.

If you do this be careful if you ever have to rewrite history of your repositories. Since the commit IDs in the manifest files won’t be considered by git when rewriting you may find your git server has garbage collected your old commits. To find a good solution to that problem leads us to start thinking about how to store the manifest information in a manner that is better integrated with git.

Versioning by having your continuous integration builds tag all your repositories

Your automated build system could trigger whenever any repository changes and then could tag the commits it used across every repository. You get to look at the changes in a repository between two builds just by diffing using the build tag numbers. In contrast to approaches using manifest files, developers can use git directly and download just the repositories they need.

We used this strategy on OpenXT, a project which had about 150 repositories at one point. A busy developer would often push five new versions of one repository a day, and sometimes we had 25 developers working on a day, and we triggered builds once commits were in. This meant it was quite common for a rarely changed repository to gain 125 tags per working day. After three years we ended up with 30,000 tags on the head of master on some repositories. The core of git copes admirably well with this but git GUIs often become unusable or crash.

Git submodule

Git itself has integrated support for submodules, which allow controlled recursive inclusion of other repositories in a repository, alongside content in the top level repository. In the words of the submodules section of the git manual:

“It can be a bit confusing, but it’s really not very hard.”

“Using submodules isn’t without hiccups, however. For instance switching branches with submodules in them can also be tricky.”

Github published a nice tutorial. If you are the git expert on your team and are considering submodules then do consider whether the extra complexity of git submodules is too much for those on your project who are least comfortable with git. Frankly, even though I’ve got deep into source code management for many years and written my own tools and even complete distributed version control systems and trained dozens of people on git, I still tend to find there are occasional situations which require careful thought and re-reading of the manual. Therefore my personal preference is to avoid introducing an avoidable source of complexity. Part of the attraction of git is that at some levels it is simple. Moreover, the complexity of submodules are in areas that everyone in the team has to deal with, such as initial set up of a developer machine and of branch switching.

Making consistent changes across several repositories

Often on a project that spans multiple repositories you find yourself needing to make a change atomically on several repositories at once if you want to make sure that any resulting build works. Sometimes you can order your changes safely so that the first can be committed without the other changes, but often that’s impossible or a lot of extra work. If you require people to review changes for instance using Github pull requests, which many regard as best practice, the logistics of getting all the breaking changes in atomically become complex, to the point where without tool support people often given up expecting each build to be stable. That’s a bad precedent; you now have the burden of communicating which builds are worth testing, and it becomes harder to spot regressions.

Early on OpenXT we did code review informally and before and after commit to master due to this difficulty. The continuous integration system used Buildbot and had some custom logic to wait for a ten minute window with no commits before starting a build so that people had a chance to push changes to all the relevant repositories. That mostly worked, at the cost of some delay on getting build results, which didn’t have such a big impact since the builds took six hours anyway.

If you stay with a single repository as you grow

None of the approaches above to versioning across a forest of repositories are ideal. As you grow you can choose to stick with including all your project work in a single repository, rather than split the repository. Let’s consider the risks you may face with a single tree repository.

Risk: Your continuous integration build gets so slow it impacts your team morale and productivity

Straightforward configuration of continuous integration tools such as Jenkins or Teamcity typically build the entire contents of your repository every time something changes. As your project grows this gets worse for 2 reasons that multiply together:

  • More developer effort produces more code which takes longer to build
  • More developers produce more changes per hour which causes more builds to be triggered.

Straightforward continuous integration cost scales as O(n^2) with the number of active developers. It is very valuable to keep the build time fast enough that people do not need to start a new piece of work while waiting for a central build to verify something they’ve just changed.  Once your build time gets over 15 minutes people will typically start something else and if further attention is required tend to wait until a convenient moment to switch back. Everyone can afford to wait 5 minutes, though if you can get your build that fast and people are frequently waiting for a computer to do something then small time savings are still valuable. So, plan to get the enough of the best build machines you can afford, and remain vigilant to build time tending to increase.

Most changes to a large project do not inherently need to rebuild everything, so it is possible to significantly reduce the build time of large repositories, by some of:

  1. Keep output files from old builds of a branch. The correctness of your continuous integration builds relies on your build correctly rebuilding after each change.
  2. Use a build system which can reuse individual files such as Bazel (if you think it is ready for your purposes)
  3. Arrange to reuse build artifacts yourself; I described some techniques in detail for doing this last year.

Risk: You include third party code or your own binary assets and make your project cumbersome

https://www.publicdomainpictures.net/en/view-image.php?image=215225&picture=broken-tree-in-bampw

If you produce binary assets, such as video or high resolution graphics, then the overhead of dealing with them can become problematic. Developers will download them automatically when they start work on your project and each developer will need disk space to store them. You may want to use cloud storage such as S3, use a file server running software such as Artifactory, or use the Git extension LFS if the git software your team uses supports it. I’d recommend taking steps early in your project to keep large files out of git, and wrote previously about how we did that after hitting a 2GB repository size limit on our git hosting service.

It is wise to keep third party source code out of your repository. You should make sure you explicitly select the version of third party code rather than just take the latest version, so that you are in control of when you take third party updates. If you want to make changes to third party source you use then either:

  1. Fork the upstream repository and update your project build to use your forked clone and specific commit IDs. This way you’ll end up with a two step process where you have to make changes to your fork, commit those, then update your project repository to use the new version.
  2. Keep a queue of patches to a specific upstream version in your repository. This way your own own changes always end up going into your repository, but those dealing with the patch queue will have to learn how to do so. This will often mean looking at diffs of diffs, which everyone finds hard to understand. Tools for patch queues include quilttopgit and stgit. Once those developers dealing with third party source changes get the hang of it though this approach can be productive, and other developers on your project can usually ignore the complexities of patch queues.

Canards

There has been extensive discussion of this topic and there are a few arguments that keep coming up. The recent blog ‘Monorepos—please don’t‘ is worth reading for balance.

“Git is only designed for small repositories so you have to split large projects”

Git was developed by the Linux kernel developers and the kernel itself got up to 25 million files by the end of 2017. Microsoft reported in 2017 that they use git for Windows, and is 300GB, used by 4000 engineers. To achieve that Microsoft developed a virtual filesystem. However, you are unlikely to hit that scale until you have thousands of developers working for more than a decade. To achieve that level of success is unusual and will involve many challenges, making occasion retooling of the way your team handle source code a relatively minor concern. I’d be careful about disregarding simple and productive ways of working because of risks that you may need to make a change in the future.

“Your project is tiny compared to Google so you don’t have the same requirements and you should make different choices”

A fascinating single case study about the way one company handles 87TB of data in itself does not constitute an irrefutable proof of the wide applicability of that one solution; nor does the impact of that paper mean that everyone who suggests that a single repository for a project is seduced by the glamour of Google and developer operations at that scale. For me, the discussion of using multiple or single repositories is pragmatic; the details of specific source code management tools are therefore important, and thus the case study (which describes a team that doesn’t use git but instead proprietary system with a virtual filesystem) is only relevant to the extent that it illuminates certain problems that can occur at smaller scales.

“Monorepos encourage tight coupling”

Decoupling a system into modules requires careful design. You can indeed twist your developer’s arms into more modularity by making it difficult to make cross-cutting changes, by splitting your project into many repositories and then providing no tooling for synchronised changes. If you choose to do that, be careful you don’t simply cause a lot of extra make-work when people need to make changes across many modules at once, at least in the early stages of a project. The trouble with procedural pain is that it can be hard to quantify since the cost gets spread across many pieces of work.

“Everything is decomposed into lots of apps and microservices with their own release cycles so you have to deal with backwards compatibility.”

Decomposition and release cycles do not have to be connected. Sometimes, you may have no choice; for instance if you are producing iOS and Android apps and a server to go with them then you will be at the mercy of the app store approval processes. That means you have to test compatibility of a new service release against the application versions you know could be found in the wild; this is a necessary and manageable quality assurance process. Splitting up server software into many services running separately is often a good idea. It helps you isolate faults, filter your logs, understand memory management better, scale out to use a range of servers and so on, and if you don’t put this into your architecture early you may find it difficult to do later. The lack of ideological rigour in a quality assurance plan imposed by the limited loss of control by using app stores does not mean you necessarily have to accept the each microservice has to interoperate with multiple versions of all other microservices. To do so exponentially increases the set of configurations you would have to test.

Even if you do end up with many components with many separate lifecycles it still makes sense to review cross cutting changes together, so that you give the reviewer visibility of each piece of work needed for something.

Using multiple repositories gives you a single level of hierarchy around ownership, licensing and abstractions. Having one level of hierarchy given prominence through the tools causes problems. Ownership can often be fluid, changing much more often than people will go through the heavy lifting of splitting up repositories, so it is often wise to keep ownership orthogonal to project structure.

“At scale, there is simply no way to rebuild the entirety of a codebase and run all automated tests when each change is submitted (or, more importantly and more often, in CI when a change is proposed).”

At suitability spectacular scale it is hard to do anything. Before you reach intractable scale though you will need a build and test system that is sophisticated enough to not simply repeat everything on every single change. If you can cache appropriately then the build capacity should grow linearly with your developer team size. Integration test time may indeed scale worse than linearly, since despite your best efforts to simplify the cost of the integration tests tends to grow over time with the number of developers, and will have to be done each time a developer submits a change, but even with a single integration branch that’s still only O(n^2). On larger projects such system testing is often deferred due to its cost and run much less frequently than on every change.

Share

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*