## Looking for needles in haystacks

By: on October 26, 2005

Here’s a proper programming challenge, simple to state, and in the end quite challenging to solve. A function produces a 64-bit output given a 64-bit input. It mostly behaves like a random function, so given a random input, most outputs occur with probability between 0 and 2-63; in other words, for a randomly-chosen 64-bit y, there are usually between 0 and 2 solutions to the equation f(x) = y. However, a few occur with much greater frequency – more like 2-32 or so. The function is pretty quick and easy to calculate, but its internal structure defies direct analysis. How would you identify these more frequent occurrences by experiment? How big a computer would you need to do it in something like a few hours to a few days? Think about it for a while before you read on.

First of all, to be at all confident of finding these high frequency outputs, you’re going to have to examine on the order of 234 outputs. That’s 128 GB of data. When you produce your last output, you want to be able to determine, at some stage, whether it’s the same as any output you’ve already produced, and I think that this makes storing all previous outputs unavoidable, so you’re going to need a big disk. Fortunately I have a 200 GB disk attached to my home machine that I’m not using yet, so I can manage this part. The tricky thing is, how are you going to arrange your data on disk so you can find duplicates?

You can’t just use the disk as one big hash table, because each write will call for a seek, and 8.5 ms times 234 is about 4.6 years. You could buffer and sort your writes, but my home machine only has 512 MB of memory and it needs some for other things, so you would have to do about 512 buffered writes and crawl through the entire 128GB data structure every time, which would be desperately slow each time. You could just write it all to disk and then sort it, but I suspect you’d be looking at a long wait for a disk-based sort of 16 billion 8-byte items to finish – just reading and re-writing the entire dataset will take over three hours, and you would probably have to do that at least nine times.

So the first idea I had was this: As you generate each datum, hash it and put it into one of 1024 bins based on the hash code, and write all the bins at once. Each bin now only contains 128 MB of data, which is sufficiently little that we can read it all into memory and do duplicate finding that way using a big hash table. I implemented this by writing to 1024 files at once, one file per bin. I even found out how to get around the limitations on how many files you are allowed to have open at a time (with ulimit). However, it was incredibly slow; it used only about 6.0% CPU because writing the files was keeping it busy all the time. I think that thousands of files all growing at once is pretty much a pathological case for any filesystem, because both Reiser3 and XFS showed this behaviour. Caveat: when I tried this I was using 2048 bins because I was having trouble with getting my post-binning collection phase to handle larger files, but I doubt halving the number of bins would have made a huge difference. I tried buffering the bins. I even tried putting a pre-buffer before the buffers that fitted into L2 cache to make more efficient use of main memory, but all to no avail. It was going to take a day or two to generate all the data.

I asked for advice online, and the best suggestion I got was to preallocate the files before writing to them. Instead, I waited until I got into work the next morning to ask my colleague David, who takes a special interest in making disks behave efficiently. After some discussion, we came to a strategy that made far more efficient use of the disks.

We allocate 256 kB of memory to each bin, ie 256 MB in total for all 1024 bins. As soon as any one bin fills up, we write the entire 256 MB data structure to disk in one great sweep, empty all the bins, and continue. This makes slightly inefficient use of disk space, because not all the bins will be full. However, because the function is so close to being random, they are all very nearly full, and so the inefficiency is only around 2%; in other words, instead of making 512 writes we end up making 522. Now the generation phase is as disk-efficient as it can be – all it does is append giant chunks of data to a file at a time – and it takes only a couple of hours at 60% CPU usage.

Of course, the cost is that now there is a collection phase to read the contents of each bin. This phase takes 522 seeks to piece together the contents of each bin, and it has to be repeated 1024 times, once for each bin; the total cost is about an hour. This is in any case comparable to the cost of finding the duplicates in each bin. Since one process is disk-intensive and the other is memory-intensive, it makes sense to do them in parallel; after discussion with David the simplest parallelization turned out to be the humble Unix pipe. “./extractbins 203 binoutput | ./finddups” means that one can be pulling the data off the disk at the same time as the other is placing it into the hash table (which is slow because practically every write is a cache miss, and you have to read before you write introducing latency). Once the hash table is full, streaming through it to find the duplicates is pretty fast, around half a second.

This story doesn’t yet have a happy ending – hardware problems with my home machine mean that I can’t run the collection phase without crashing it, which I think means I need to buy some decent memory for it. But it was very cool to get a twelve-times speedup for the whole process out of half-an-hour’s discussion with a knowledgable colleague.