Repository crawlers for Mercurial (or why you need to learn about revsets)

By: on January 30, 2011

Recently I needed to write a tool to crawl a Mercurial repository and look for certain things in unfinished branches that could cause us problems in the future. Given I knew that Mercurial was written in Python, my first approach to this was to start digging around in its code and see if there was anything in there I could cannibalise to build what I needed. This appeared to be bearing fruit, as I found the ancestor module pretty quickly, but I rapidly realised that in order to do something relatively simple I was going to have to copy vast reams of the Mercurial code to support it.

This is the wrong approach, or at least for most purposes it’s a really bad idea, as there’s a much easier way to write this sort of tool. The primary goal of a crawler tool is to get some well defined subset of the revision tree and then either confirm or alter some properties of those revisions. Most of the time and effort I was expending was in trying to find the wanted revisions, but Mercurial 1.6 added in a lovely feature called “Revision Sets” or revsets for short (which thankfully one of my colleagues pointed me towards before I got too deep into the brute force approach). hg help revsets will tell you the full syntax, but the first words of that (“Mercurial supports a functional language for selecting a set of revisions”) tells you all you really need to know.

To use them in your code, we need a little bit of boilerplate first

The revs routine just makes it easier to use revsets (it could also be done as a generator, but I found I needed to do list slices of things too often for that to be useful), and you could make repo point to another specified location if you want, but this’ll work with the repository in the current directory.

Ok, now here’s some fun things I just used in my script

(shoving things in as strings isn’t particularly Pythonic, but such is the price of using an embedded DSL)

All of this can also be used at the command line, but in a lot of cases you then want to do some more filtering (e.g. I wanted to find certain patches whose format would be hard to specify with a regular expression, but was simple with a line or two of Python), or print out various bits of information about said revisions. Net result of using this was the tool I needed got written in a couple of hours v.s. potentially days worth of work if I’d continued down the original route.

All of this is written with a unmentioned proviso, in that the first section of the Mercurial API page is called “Why you shouldn’t use Mercurial’s internal API”, and that’s exactly what we’re doing here. However, by using revsets as opposed to any other part of the API, we’ve got an advantage in that Mercurial themselves advise using “Mercurial’s published, documented, and stable API: the command line interface”, and revsets are explicitly part of the command line interface. This doesn’t mean they, or the interface we’re using to get to them is guaranteed not to change next revision, but the odds are better than for most parts of the code.



  1. Johnny says:

    “certain things in unfinished branches that could cause us problems in the future.”

    I’m curious what those might be.

  2. Tom Parker says:

    @Johnny: Database migrations. There’s a case where if you have a set of migrations – let’s call them 1,2 and 3 (in real life they’re datestamps instead of just plain numbers) – and your default branch has 1 and 3, and a bug branch has 2. If you deploy default, the migration script runs and remembers it’s dealt with everything up to 3. If at a later time, the bug branch is merged into default, and then redeployed, 2 is now in the list of patches but doesn’t get run as 2 < 3 and 3 has already been run.

    Some systems solve this by just running 2 as they’ve got a record of which ones have been applied, but that doesn’t work if 2 and 3 are interdependent, and AFAIK no current database migration system keeps track of that information.

  3. Johnny says:

    Would love to see your solution opensourced.

  4. Tom Parker says:

    @Johnny: We haven’t got a solution for the problem, just a chunk of code that’ll tell you there’s a potential problem, and it’s a) rather specific to the source tree of a particular project and b) is overly pessimistic about there being a problem.

    I’ve got an actual solution in mind, that would dig through the migrations and do this properly, but that will take a lot longer to write. If I eventually have the time to do so, that’ll get open-sourced.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>