Everything you ever wanted to know about git submodules and more

By: on January 31, 2015

I regularly hear complaints that git submodules are difficult to work with. If you search for ‘git submodules’, then (depending on your filter bubble) you’ll probably get several blog articles warning you not to use them. I agree that the UI is not at all intuitive,1 but like most things in git, submodules are quite simple under the hood. In this post I’ll share the incantations for solving some specific submodule problems, and try to shed some light on what’s really going on.

Some time ago, someone here wrote an incredibly useful and not at all contrived python library for multiplying integers. I decided to write a convenient CLI for this library. First I started a new repo:

$ mkdir multiplier-cli && cd multiplier-cli

$ git init
Initialized empty Git repository in .../.git/

Then I added the library as a submodule, because it’s not on PyPI:

$ git submodule add git@github.com:lshift/multiplier
Cloning into 'multiplier'...
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 5 (delta 0), reused 5 (delta 0)
Receiving objects: 100% (5/5), done.
Checking connectivity... done.

Then I wrote my CLI, multiply, and a .gitignore, and committed everything:

$ git add -A && git commit -m 'initial commit'
[master (root-commit) 8edaf0a] initial commit
 4 files changed, 19 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 .gitmodules
 create mode 160000 multiplier
 create mode 100755 multiply

The problem

Before long, one of the many users of multiplier-cli reported that it was pretty slow and memory-hungry for large numbers:

$ time ./multiply 100 100000000
10000000000
./multiply 100 100000000  10.16s user 8.34s system 74% cpu 24.771 total

I was able to optimize the library, but the author of the original multiplier was adamant that 9-digit numbers were out of the libary’s scope and wasn’t willing to sacrifice readability for higher performance. So I decided to maintain my own high-performance fork with a suitably snappy new name. However, this left me in the awkward position of having to convince git that to use the new repository.

The theory

Let’s pause for a second and see what actually happens when you run:

git clone --recursive https://github.com/ash-lshift/multiplier-cli

This is approximately equivalent to:

git clone https://github.com/ash-lshift/multiplier-cli
cd multiplier-cli && git submodule update --init --recursive

…which is in turn approximately equivalent to:

git clone https://github.com/ash-lshift/multiplier-cli
cd multiplier-cli && git submodule init && git submodule update --recursive

According to the docs, git submodule init copies “submodule names and urls from .gitmodules to .git/config“. They mean exactly what they say:

$ cat .gitmodules
[submodule "multiplier"]
        path = multiplier
        url = git@github.com:lshift/multiplier

$ cat .git/config
...(snip)...
[submodule "multiplier"]
        url = git@github.com:lshift/multiplier
...(snip)...

As for git submodule update, the docs say it will “clone missing submodules and checkout the commit specified in the index of the containing repository”. Cloning and checking out aren’t too cryptic, but where’s that commit specified exactly?

“Index” refers to the contents of the .git/index file. (The word “index” is unfortunately used pretty much interchangeably with “cache” and “stage”.) You can see the contents of this file in a human-readable form like so:

$ git ls-files --stage  # for stage, read index
100644 0d20b6487c61e7d1bde93acf4a14b7a89083a16d 0   .gitignore
100644 180d2644e702a3fbd03e656170675a1e19e3248e 0   .gitmodules
100644 2bf839e3c692a131cd18e291cfdb09cc16669acb 0   README.md
160000 37690b767a226b1baea7984f531e7853a23e401f 0   multiplier
100755 18e17afd796372e1651213d9460a220404ddbc2c 0   multiply

Let’s back up a bit: for almost all of git’s operations, it’s sufficient to understand the contents of .git/objects and .git/refs. (If you’ve never looked in the .git directory, now may be a good time to read the official but fairly gentle introduction to the internals of git.)

The commit we’re currently on is called HEAD. We can get the ID of the object that represents this commit like so:

$ git rev-parse HEAD
30071f25cc3bac610b78b91e7eb45aef8a3f15a1

This object is stored in the file.git/objects/30/071f25cc3bac610b78b91e7eb45aef8a3f15a1, but it’s not human-readable. We can pretty-print it:

$ git cat-file -p $(git rev-parse HEAD)
tree 0917333474bfb6d27330a2b2320efb37dc738646
author ash-lshift <ash@lshift.net> 1422719429 +0000
committer ash-lshift <ash@lshift.net> 1422727190 +0000

initial commit

The tree line tells us which object describes the state of the repository after this commit. It looks like this:2

$ git cat-file -p 0917333474bfb6d27330a2b2320efb37dc738646
100644 blob 0d20b6487c61e7d1bde93acf4a14b7a89083a16d    .gitignore
100644 blob 180d2644e702a3fbd03e656170675a1e19e3248e    .gitmodules
100644 blob 2bf839e3c692a131cd18e291cfdb09cc16669acb    README.md
160000 commit 37690b767a226b1baea7984f531e7853a23e401f  multiplier
100755 blob 18e17afd796372e1651213d9460a220404ddbc2c    multiply

Git tree objects are usually a list of references to other trees and blobs, but here there’s a commit mixed in. Can we take a look at that?

$ git cat-file -p 37690b767a226b1baea7984f531e7853a23e401f
fatal: git cat-file 37690b767a226b1baea7984f531e7853a23e401f: bad file

Oh. Git doesn’t know about that commit. So why isn’t everything completely broken? Well, the short answer is that the entry has the special mode 160000 which causes git to try to find the object in submodules. (For the morbidly curious: try grepping for S_ISGITLINK in the git source code.) Sure enough:

$ git --git-dir=multiplier/.git cat-file -p 37690b767a226b1baea7984f531e7853a23e401f
tree 79eebfe5deeba6fb1b8ebef28ace995b2a0e82e9
author A. N. Other <ash+another@lshift.net> 1112908393 +0100
committer A. N. Other <ash+another@lshift.net> 1112908393 +0100

initial commit

So that’s how git submodule update decides which commit to check out.

The practice

So, in order to convince git to use the new library, we “just” have to change .gitmodules, .git/config and the reference to the commit. The first two are easy enough. (You don’t even need to update .git/config, because it’s local to your repository and noone else will see it. That, and git submodule sync will update it for you.) Here we go:

$ git config --file .gitmodules submodule.multiplier.url git@github.com:ash-lshift/multimaster-6000  # update .gitmodules
$ git submodule sync  # update .git/config
$ git add .gitmodules

As for changing the reference, first we need to work out what we want to change it to:

$ git ls-remote git@github.com:ash-lshift/multimaster-6000 master
c0c5e1474a9306eb8cb92b9e9ad9f45a2cc9c01f    refs/heads/master

Now here’s the most direct way possible of updating the reference:

$ git update-index --cacheinfo 160000,c0c5e1474a9306eb8cb92b9e9ad9f45a2cc9c01f,multiplier

We’re almost done. Here’s the grand total of our changes:

git diff --cached  # for cached, read index
diff --git a/.gitmodules b/.gitmodules
index 180d264..9214d60 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +1,3 @@
 [submodule "multiplier"]
        path = multiplier
-       url = git@github.com:lshift/multiplier
+       url = git@github.com:ash-lshift/multimaster-6000
diff --git a/multiplier b/multiplier
index 37690b7..c0c5e14 160000
--- a/multiplier
+++ b/multiplier
@@ -1 +1 @@
-Subproject commit 37690b767a226b1baea7984f531e7853a23e401f
+Subproject commit c0c5e1474a9306eb8cb92b9e9ad9f45a2cc9c01f

Now we can commit and push, and all new clones of the repository will be nice and up-to-date. In fact, you can see the results here: the master branch is using the original library and the faster branch is using the new one.

However, existing clones will need to be updated. Here you come up against the UI again.

Updating existing clones

Suppose a user makes a new clone:

$ git clone --recursive gh:ash-lshift/multiplier-cli && cd multiplier-cli
Cloning into 'multiplier-cli'...
...(snip)...
Submodule 'multiplier' (git@github.com:lshift/multiplier) registered for path 'multiplier'
Cloning into 'multiplier'...
...(snip)...
Submodule path 'multiplier': checked out '37690b767a226b1baea7984f531e7853a23e401f'

They’re currently on master because that’s the default branch, but suppose they want to try out the faster branch:

$ git checkout faster
M   multiplier
Branch faster set up to track remote branch faster from origin.
Switched to a new branch 'faster'

What’s that line about multiplier?

$ git status
On branch faster
Your branch is up-to-date with 'origin/faster'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   multiplier (new commits)

no changes added to commit (use "git add" and/or "git commit -a")

Oh. Let’s try what might seem like the obvious solution:

$ git submodule update
fatal: reference is not a tree: c0c5e1474a9306eb8cb92b9e9ad9f45a2cc9c01f
Unable to checkout 'c0c5e1474a9306eb8cb92b9e9ad9f45a2cc9c01f' in submodule path 'multiplier'

Why? Because when we first cloned, our .git/config was updated, but when we checked out faster, it wasn’t:

$ git config submodule.multiplier.url
git@github.com:lshift/multiplier

Let’s try again:

$ git submodule sync && git submodule update
Synchronizing submodule url for 'multiplier'
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 3 (delta 1), reused 3 (delta 1)
Unpacking objects: 100% (3/3), done.
From github.com:ash-lshift/multimaster-6000
   37690b7..c0c5e14  master     -> origin/master
Submodule path 'multiplier': checked out 'c0c5e1474a9306eb8cb92b9e9ad9f45a2cc9c01f'

Job done.

The normal way of doing it / TL;DR

Normally, you wouldn’t use git update-index — it’s considered low-level, and the docs claim that git add is “a more user-friendly” way to achieve many of the same results.

To change the submodule to the new multimaster-6000 fork, we could equally have done the following:

$ git config --file .gitmodules submodule.multiplier.url git@github.com:ash-lshift/multimaster-6000

$ git submodule sync
Synchronizing submodule url for 'multiplier'

$ git add .gitmodules

$ git submodule foreach '(git fetch && git reset --hard origin/master)'
Entering 'multiplier'
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 3 (delta 1), reused 3 (delta 1)
Unpacking objects: 100% (3/3), done.
From github.com:ash-lshift/multimaster-6000
   37690b7..c0c5e14  master     -> origin/master
HEAD is now at c0c5e14 optimize multiply

$ git add multiplier

$ git diff --cached
diff --git a/.gitmodules b/.gitmodules
index 180d264..9214d60 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +1,3 @@
 [submodule "multiplier"]
        path = multiplier
-       url = git@github.com:lshift/multiplier
+       url = git@github.com:ash-lshift/multimaster-6000
diff --git a/multiplier b/multiplier
index 37690b7..c0c5e14 160000
--- a/multiplier
+++ b/multiplier
@@ -1 +1 @@
-Subproject commit 37690b767a226b1baea7984f531e7853a23e401f
+Subproject commit c0c5e1474a9306eb8cb92b9e9ad9f45a2cc9c01f

Bonus incantation

You can blow away all the local/transient submodule state if you get stuck:

$ git submodule deinit -f .  # destructive!
Cleared directory 'multiplier'
Submodule 'multiplier' (git@github.com:ash-lshift/multimaster-6000) unregistered for path 'multiplier'

Footnotes

  1. I would venture that writing the submodule command as a shell script with embedded perl negatively impacts the likelihood that someone will try to improve its UI.
  2. You can get the tree object corresponding to the HEAD commit object directly:
    $ git rev-parse 'HEAD^{tree}'
    3a8ea818dda8c3435d314a4f7550469475dd9e45

    For more little-known tricks along these lines see the ‘specifying revisions’ section of the docs.

Comment

  1. Douglas says:

    I don’t really like git submodules vendoring approaches, because they assume all of your dependencies are going to be version controlled via git. It seems like a bad thing to couple a build toolchain to a particular vcs.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*