Yet another Key-Value database

By: on August 21, 2009
Following the NoSQL movement, I became a fan of key-value databases. Usually there’s nothing interesting to say as they work fine out-of-the-box. But in a project I was recently working on K-V store started to be a major bottleneck.

I must note that my use case is pretty specific. In a K-V database I store about ten million records. The keys are pretty small, about 12 bytes on average. The probability of hitting every single key is more or less the same, which means that adding a cache layer will not make things go faster. The values are quite small, up to few kilobytes. The access pattern is also not common: I only do record updates. This requires reading a value and writing a new one few bytes larger. So, unlike normal use cases, the number of writes is substantial.
I experimented with Berkeley DB and Tokyo Cabinet, but I noticed that both engines do similar things internally. Let’s take a look at what happens there.
Berkeley DB
In my experiments K-V databases were I/O bound. To see what’s happening inside I used strace. In this strace log it’s clearly visible that BDB is doing random reads and writes all over the file (the last parameter to pread/pwrite syscalls is the file offset):
pread64 (3, ..., 4096, 32632832)
pwrite64(3, ..., 4096, 35389440)
pread64 (3, ..., 4096, 25407488)
pwrite64(3, ..., 4096, 43790336)
pread64 (3, ..., 4096, 23085056)
pwrite64(3, ..., 4096, 43831296)
pread64 (3, ..., 4096, 28508160)
pwrite64(3, ..., 4096, 10280960)
Doing such a random writes is not a good idea; the system quickly gets stuck on synchronizing dirty pages to disk (this is the output of strace -c):
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 87.58    0.013548           1      9228           pwrite64
  0.98    0.000152           0      8475           pread64
Strace reveals similar pattern in TokyoCabinet:
pread64 (3, ..., 48,   108776352)
pwrite64(3, ..., 5,    108772688)
pwrite64(3, ..., 4224, 131860800)
pread64 (3, ..., 48,   81600048)
pwrite64(3, ..., 5,    81600048)
pwrite64(3, ..., 432,  126273472)
pread64 (3, ..., 48,   96069904)
pwrite64(3, ..., 5,    96066128)
pwrite64(3, ..., 4256, 133663632)
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
  93.76    0.010568           1      9296           pwrite64
   1.90    0.000214           0     22891           pread64
TokyoCabinet can mmap(2) the whole file and do no read/write syscalls at all. That gives a boost, but dirty pages will eventually need to be written to the disk. I suspect it is done in a fairly unpredictable order.
YDB, Why not?
Both databases try to reuse disk space from deleted records. They overwrite places that are no longer used and we see writes with some random offsets. In my opinion it’s strange: I have a few-hundred-gigs hard drive and I don’t really care how much disk space is wasted. But I really do care about every disk-seek. What if in a database we’d only appended data and never modified already written records? This way we could save half of disk seeks on random writes and do only a sequential append once in a while. Writes will on average cost nothing, but we’d pay the cost of disk space as it (almost) never is reclaimed. That’s the idea behind my experimental, yet another, K-V database: YDB.
Like previous databases, let’s show YDB in action:
pread64(4, ..., 68, 31661472)
write  (5, ..., 104)
write  (5, ..., 96)
pread64(4, ..., 60, 19781776)
write  (5, ..., 100)
pread64(4, ..., 60, 52593732)
write  (5, ..., 104)
As we’re doing writes on an append-only file descriptor, we don’t need to set the write offset and we can avoid using pwrite(2). On the other hand, YDB doesn’t do any caching, so it’s flooding kernel with a lot of tiny write syscalls, way more than it should. But with an append-only file descriptor the writes are really cheap. As opposed to TC/BDB they aren’t the most expensive syscall:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
31.60    0.000963           0     24520           write
2.23    0.000068           0      4944           pread64
Features and anti-features
The design I chose has some advantages and disadvantages:
  • YDB is append-only key-value database (aka: log-like or journaled).
  • The index is in RAM, so retrieving any item costs at most 1 disk seek.
  • The index is in RAM, so all key values must fit into RAM. This means that on 32 bit systems the index can’t be larger than about 1.5 gigabytes.
  • YDB is not a general-purpose database, it will fall behind other K-V stores in most of the benchmarks. Especially in disk usage, starting time and memory footprint.
  • YDB is designed for spinning disks, but it still can be faster than competition on cheap SSD drives. This is due to the fact that random writes are very expensive on broken SSD’s, while sequential writes are extremely fast.
  • There is a simple Garbage Collector that periodically crawls the logs. It appends the data from old logs to recent one and removes old, no longer used logs.
  • When GC is running you should expect YDB to run about two times slower for a while.
  • Even though we have GC, YDB shouldn’t run in an environment with limited disk space. Assume at least 5-8 times more disk space than you need.
  • This means that you store only about 100GB of usable data on 800GB hard drive. This is how it works.
  • YDB will be considerably faster only if you have a lot of relatively small keys and do a lot of writes. Otherwise TC/BDB can be good enough.
Works for me
Let’s try to emulate my use case. Here are some technical details about a test:
  • I start with an empty database.
  • I do a lot of small updates and end up in having 2MLN of items.
  • Total size of keys and metadata is about 100MB.
  • Average value is 142 bytes long.
Output database sizes:
  • TC 260MiB
  • BDB 320MiB
  • YDB 492MiB
I did the tests on my quad core Linux machine with 4 gigs of RAM and some generic spinning disk inside.  BDB was configured to use 96MiB of cache, TC had some pretty high caching options (cache value of 12MLN, xmsiz value of 750MB). These settings were taken from these benchmarks.
Finally, my benchmark took (lower is better):
  • TC: 610 seconds
  • BDB: 1043 seconds
  • YDB:  271 seconds
Second benchmark
YDB is terrible in most of common benchmarks. It wastes a lot of disk space, it starts slowly. On the other hand, when you have a lot of random writes YDB behaves well. I included simple benchmark software in YDB sources. Feel free to check it yourself!
 $ cd ydb/benchmarks
 $ make
 Running benchmark: more_than_ram2
    tc: 569204ms
   bdb: 816312ms
   ydb:  99022ms
This benchmark, simply sets 16K keys to some random values which vary in size between 4KiB to 20KiB. This is repeated 16 times. In this particular benchmark, after flushing disk caches, YDB was about 5 times faster than TC and 8 times faster than BDB.
Some reasons you might want to try YDB:
  • You are doing a lot of writes to K-V store.
  • Your database size is substantial.
  • You’re running out of disk seeks.
So, if you have too much free disk space and your K-V database is I/O stuck, give YDB a try! Why not?



  1. MikeZ says:

    When comparing against Berkeley DB, what database type did you use?

    The default is btree, which spends IO updating pages to maintain records in sorted order. This is important for retaining locality of reference for cases where you want to iterate over a bunch of adjacent records using a cursor, but as you see, that comes with some cost.

    However, Berkeley DB provides other database types:

    HASH – for faster inserts without maintaining locality of reference, while retaining the ability to quickly search for an arbitrary key.

    RECNO – for append-only inserts that generate keys (record numbers) that can be used to randomly access fixed-size records.

    QUEUE – for FIFO appends and reads of fixed-size records with automatic db file extension and removal.

    For a fair comparison with Berkeley DB you should look at these other database types. The HASH db type may be what you are looking for, especially if you have variable-length records, but if you find the cost of keeping the on-disk structure too high and you can reduce your reliability requirements, you might look into RECNO or QUEUE with an in-memory index mapping keys to record numbers.

  2. Paul Crowley says:

    I’d be interested to see a comparison with BDB-JE, which is a pure log-structured database.

  3. MikeZ says:

    Whoops! I see the benchmark you mentioned DOES cover the BDB HASH type, good! So I wonder why Tokyo Cabinet is so much cheaper …

  4. Adam Ierymenko says:

    Tokyo Cabinet has caching settings, but the default is not to cache so as to ensure that data is written immediately to the DB in case your program crashes. This is often what you want if you’re not a speed demon.

    If you turn on caching, Tokyo Cabinet runs even faster and shouldn’t bang on the disk as much.

  5. marek says:

    @Adam Ierymenko

    In the benchmarks I did TC caching setting was set to 12000000. See the source here.

  6. What you’ve implemented reminds me of postgresql. Without vacuuum it only writes new rows. You can turn of fsync if you don’t care for it. I’m not sure if the log file writes will screw your seeks, but you could always put them on another drive (ram disk).

    The main problem is that it’s non-no-sql. But some people like that.

  7. Interesting post – I use strace all the time, but I never knew about the -c argument until reading your post. 🙂

  8. mehmoomoo says:

    100-200M is “substantial” ? Huh?

    That small databases you should just cache completely in memory, and write just the transaction logs for changes.

  9. Chase Saunders says:

    Check out RethinkDB –

  10. marek says:

    I know 200MB is not really “substantial”, but it’s enough to kill the databases with disk I/O.

    Yes, databases of such sizes should be kept in-memory only, and all the changes should be kept in transaction logs. This is exactly what YDB will do if your data fits into RAM. But this is not what BDB/TC are doing.

  11. Neil says:

    The probability of hitting every single key is more or less the same, which means that adding a cache layer will not make things go faster.

    Surely, in this case, if you have a cache that caches 20% of your data, then your cache hit rate will be 20%, and you’ll see a proportional speed-up. So things will definitely get faster, just not as much as if you had an access pattern that was skewed.

  12. Paul Crowley says:

    @Neil: The OS-level disk caching should suffice to bring that speedup, shouldn’t it?

  13. dyu says:

    This seems quite a lot like bitcask … with this being implemented first.
    Regarding startup, I believe this could be optimized using hint files (like what bitcask is doing). See

  14. Daniel Waterworth says:

    Hi, I wondered if you’d be interested in aodbm – Append Only Database Manager (see )? It uses a on-disk B+Tree.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>