For the last couple of months I’ve been working on rewriting RabbitMQ’s persister so that it will scale to volumes of data that won’t fit in RAM, and will perform consistently across a wide variety of use cases. This work is coming to a conclusion now, and although the code is not yet released, nor has it even been through QA, benchmarking it thoroughly is useful to allow us to understand what’s good and what’s bad about the new design. In this post I’m not going to do any before and after comparisons — they’ll be coming in due course. Instead, I’m going use RabbitMQ to benchmark harddiscs — an SSD, and a normal rotating harddisc. As someone said at the presentation we gave at the recent Erlang Factory, “using SSDs are just like RAM”. Cue expectations of a turbo-charged, overclocked, overvolted Rabbit, with liquid nitrogen cooling.
So, I arrived at my desk on Monday morning to find Father Christmas had woken early from his year-long hangover, and dropped off an OCZ Vertex SSD 60GB. Now I have read the massive article on Anandtech about how SSDs work, and how most except the Intel ones have awful write performance for anything except sequential writes, and how the new OCZ Vertex range have changed that, have pretty good performance and are usefully not as crippling expensive as the Intel SSDs.
Monday I pretty much spent getting its firmware upgraded (ugh, Windows, DOS etc) getting it all set up, and just hammering the hell out of it. Playing with lots of different filesystems (ext2, ext3, ext4, xfs, btrfs) and just dialing everything to 11. But by the end of the day, I was having a feeling that not all was well. It seemed to be going slower than it had been initially, and I couldn’t really get it to go that much faster. Over night, I remember that yes, SSD performance does degrade once all the empty sectors of the disk have been written to once, because to fill in the remaining gaps, the entire sector has to be erased and rewritten. So the degradation was to be expected. So I thought, ok, I’ve now got to a stage where the drive seems to be well worn in, so let’s benchmark it now.
So, the SSD is 60GB, with a 64MB cache, running OCZ’s 1.30 firmware. The spinning disk is a Western Digital Caviar SE16, 320GB with a 16MB cache (WD3200AAKS). Both disks are formatted using ext3, with exactly the same options, and mounted with data=ordered.
Test 1: Start Rabbit, create a queue, set the queue to disk-only mode (this is a new feature), send in 3 million 1024-byte messages. Measuring, I’m taking microseconds since epoch and running iostat. I’m doing this about ever 0.4 seconds. It doesn’t matter if the interval between calls to iostat aren’t totally even because I’m capturing the timestamp too.
I’ve divided the 3-million into 3 runs, each of 1 million messages, but the queue isn’t emptied in between runs — i.e. by the end, the queue did have 3 million messages in it.
What can we see here? Well, the spinning disk is just faster. It gets to the end of each million at least 10 seconds sooner. Are either disks saturated? No. The writes are pretty simple — the message content itself gets appended to plain files, so there is almost no seeking going on there. However, we use mnesia to maintain an index into these files. Mnesia is running our table in disc_copies mode, so from time to time it’ll decide to dump out to the disk. That’ll be in a different part of the disk and will cause some seeking, but really should be another large bulk write. Also note that I’m plotting just writes. There were no reads going on at all during this test. CPU load is fairly high, but quite a lot of the time, XOSView does show that processes are stalled waiting on IO to complete. So you think that one doesn’t look too bad? We know that SSDs have traditionally been optimised for sequential performance at the expense of random access and latency. So let’s go from 3,000,000 1024-byte messages to 300,000 10,240-byte messages. This should suit the SSD better, right?
Wrong! It’s even worse, and I promise you’ve I’ve not got the two sets of data reversed!
Some of you may be wondering why each run seems to write a different amount of data to disk. That puzzled me too. Our best guess is that it’s affected by the filesystem doing coalescing of writes, eg of metadata, and the interaction between the barriers there with the dumps coming from Mnesia. Please let us know if you have further ideas!
Finally, one thing we all agree on is that SSDs have awesome read performance. Running the venerable dd shows that the SSD can happily sit there reading at 150MB/s so this really should out perform the spinning disk, right? The test then is to set up an auto-acking consumer and read out those 3 million 1024-byte messages. Here, I used free, dd, and /dev/zero to make sure that before starting, the OS did not have any caches of the files Rabbit would need to be reading from. Also, as the deliveries occur, there will be writes as we have to update the Mnesia tables to indicate the messages have been delivered (and ack’d).
So here, we find the SSD and spinning disk are basically matched for performance. A result! The best thing about this graph is how easy it is to see the size of the Mnesia table. It starts of with 3 million rows in it, and so when it is dumped, the “step” in the writes is quite large. As it gets smaller and smaller, the dumps also get smaller. Brilliant!
Some quick maths is even more exciting: 3 million 1024-byte messages would suggest that we should read, well, about 3 GB. Now we actually seem to read a bit more, but there are some fixed overheads in the file format (length prefixes, trailing status bytes, etc) so the amount of data read seems very likely indeed. What’s a little surprising is that in the course of reading 3GB, we actually write out nearly 6GB in updates to the Mnesia table. Now of course, this won’t amount to 6GB disk space, because we’re constantly rewriting the same table, but nevertheless, it is rather eyeopening.
Looking back at the writing graphs, we see a similar story. The raw data being written is about double the amount of message data. Each trace amounts to about 1GB (i.e. 1 million 1024-byte messages or 100,000 10,240-byte messages) and yet we see between 2GB and 2.5GB of data actually being sent to disk. This is, if not alarming, then certainly somewhat eyeopening.
The bottom line, however, is no, SSDs are not just like RAM, and certainly as far as Rabbit is concerned, for high-throughput operation, they are nowhere near viable as replacements for spinning disks. Latency of writes may have improved over the initial models, and random access may too have performed. But at the end of the day, for our particular access patterns of reads and writes, the spinning disks still win.