A memory gotcha

By: on August 30, 2016

A couple of weeks ago I was reading Juho Snellman’s blog on implementing a hierarchical timer wheel, and as usual, over on the morning paper, Adrian’s covered a paper on various approaches to timer structures. What I found most interesting though is the final graph on Juho’s blog post where he does some performance testing with large numbers of events and finds:

Despite all the important operations being constant time in theory, performance degrades non-linearly and really hits a wall around the 256k work units, where a doubling the workload increases runtime by a factor of 5. It’s almost certainly a cache issue. The i7-2640M I ran the tests on has a 4MB cache, so for the 128k work unit case it’s already a struggle to keep even the events on the core wheel in L3. So the suboptimal scaling is probably to be expected.

With that sort of in my mind, a colleague recently pointed me at Emil Ernerfeldt’s blog and his four part series from 2014 on how we really shouldn’t assume memory access is ever O(1), and that whilst we can use big-O notation to analyse the relationship between input size and instruction count, we can also usefully use it to analyse the relationship between input size and memory access time. In terms of understanding the latency characteristics of our programs, this may well be more important.

This also reminded me of the beware of large constant overheads mantra. For example, sure, you can use a much more complicated algorithm, and you can get the complexity down to O(log_2(N)) for some particular problem. But if the constant factor that big-O notation throws away is large then you’ll likely find that your complex “optimal” algorithm is in fact much slower for small problem sizes. This can be due to either having to always do millions more instructions, or be due to a lot of pointer chasing and cache misses. So if you happen to know that you’re only ever going to see smallish input sizes, then a simpler and less sophisticated algorithm may well be faster.

An example of this could be a hash-map. Let’s assume the hash-map uses a modern hash function, for example SipHash. The result of hashing a key is going to be a 64-bit value, so we have 2^64 possible hash codes. Assuming that the hash function we’re using achieves uniform distribution of keys to hash codes, we can then use the Birthday Paradox to find that we’re going to need around 5.1*10^9 different keys before we have a 50% probability of two keys hashing to the same value. 5.1*10^9 is greater than 2^32 (and less than 2^33). So if we’re on a 64-bit architecture, with 64-bit pointers serving as the keys to the hash-map, then we’re going to take up 8*2^32 bytes (or 32GB) of memory just in keys before we have a 50% probability of a hash collision. So if we now want to decide what to do when we do have a hash collision, the answer is almost certainly not to implement something complicated like a B-tree in the hash buckets because that added complexity is never going to be worth it. A simple array of values is probably the right answer even if it involves copying memory: the probability of a collision is so small that it will almost always be the case that there is no collision, so avoiding any large constant overhead in the hash bucket is far more important.

Anyway. At some point last year, I wrote a skiplist implementation in Go. Now for various reasons, Go is my default language these days, and I also quite like skiplists: they’re easier to implement than B-trees as you don’t have to worry about re-balancing, but they still have O(log_2(N)) complexity on the important functions and doing an ordered traversal through the collection is really simple and cheap (which is what I really needed at the time). In Go, because the language allows you to distinguish between structs as values and pointers to structs, a lot of optimisation comes down to batching up memory allocations: rather than just allocating a new struct every time you need one, instead allocate a sensible number of the structs as an array of values up front, and then you can peel them off as you need them. This achieves two things: first you call malloc less frequently, and secondly, because the struct values are allocated as an array, they’re in a contiguous block of memory. That means they’re more likely to be in the same cache line so reducing the cost of pointer-chasing, and it might also make life easier for the garbage collector. This strategy is not possible in all languages. So for example, in my skiplist, a node in the list is modelled as:

type Node struct {
    Key        Comparable
    Value      interface{}
    heightRand float32
    prev       *Node
    nexts      []*Node
    skiplist   *SkipList

Now whenever I need a new Node (for example a new key-value pair is being added to the skiplist), I could just malloc the memory and set up the struct with something like node := &Node{ ... }. But this could lead to a lot of memory fragmentation and expensive pointer chasing. So instead I do some batch allocation: the SkipList struct itself looks like:

type SkipList struct {
    length             uint
    terminus           *Node
    levelProbabilities []float32
    curCapacity        uint
    curDepth           uint
    nodes              []Node
    ptrs               []*Node
    localRand          *rand.Rand

of which the important entry is nodes. Notice that this is a list of Node values, not a list of pointers-to-Nodes, so we’ve fully allocated the memory for those nodes as a contiguous block of memory. Now when we need a new node, we can write a little helper method:

func (s *SkipList) getNode() *Node {
    if len(s.nodes) == 0 {
        s.nodes = make([]Node, int(s.curCapacity))
    n := &s.nodes[0]
    s.nodes = s.nodes[1:]
    return n

So if our list (slice in Go terminology) of nodes is empty then we allocate a new block of Node values with the size of the block equal to the current capacity of the skiplist (capacity increases exponentially, basically the same as with a tree). What we return is the pointer to the first element within the list, and we advance the start of the list to the next element, essentially treating the list as a queue and performing a shift-like operation. This means that even though we might be using these nodes in completely different parts of the skiplist, fetching one Node from memory into a cache-line will bring in others too at the same time, and so there is a chance that pointer-chasing through the skiplist will require fewer loads from memory than if we just malloc’d up each Node as we needed it. Benchmarks demonstrated this to be the case.

So finally, we get to the gotcha. The above code once didn’t look like that. A user reported surprising amounts of memory being used and even messing around with manually provoking Go’s GC to run, and asking it to return memory to the OS didn’t free up enough RAM. Now sadly Go is currently pretty hopeless at debugging this sort of issue (if I’ve overlooked some tool that does what I need, please comment below!): you can do memory profiling, but all that tells you is which methods tend to allocate memory. What I want instead is a dump of the heap and the ability to draw a graph of all the pointers from the GC-roots through the heap, with objects correctly labelled by type. Even just drawing the strongly-connected components of that graph would be super useful.

In the above code, what I’m doing is peeling off from the front of the list of nodes. In Go, a slice is really a struct containing three things: a pointer to the first element of the underlying array; an int of the length of the array; and an int of the capacity of the array. Previously, rather than doing a shift-like operation and peeling off from the front of the slice, I’d been doing a pop-like operation and peeling off from the end of the slice. Logically, it’s the same, or seems like it should be. But the pop version will only reduce the length of slice, not the capacity (i.e. you could subsequently do an append to the slice, and the append implementation would know that there’s spare capacity in the array so it can use that capacity rather than allocating a whole new bigger array and copying everything over to it). What that means is that the used bits of the slice are still reachable from the nodes field within the SkipList object. Therefore, they can’t be garbage collected, even if they’re no longer being used as nodes within the skiplist itself, hence the leak. In contrast, once you advance the start of the slice (i.e. the pointer to first element in the underlying array), there’s no going back, so if that Node is used and then removed from the skiplist, it really can be reclaimed.

So quite a long journey to get to an annoying and tricky-to-track-down bug. Accessing memory isn’t O(1) in time, and planning memory allocations to play nicely with caches can be very important to optimise performance – so much so that it’s worth writing additional code and spending extra CPU cycles to make that happen. But as ever, the more code you write, the more bugs you have.



  1. uwe_wolter@t-online.de says:

    Hi, do you mind if I reference your “content” a.s.a.p.!??

    Bernd Wolter, Waldstr.2a, D-18442 Kummerow

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>