Google’s latest open-source initiative: Protocol Buffers. Like XML, this is a generic language-neutral hierarchical container format, but unlike XML it is a binary format designed for low size overhead and fast reading and writing; in this it resembles some other initiatives to that end, including Facebook’s Thrift. I am keen to see a good binary alternative to XML become popular and there are several clever and attractive aspects to PB’s design, but there are several shortcomings in the wire format, and I don’t think it’s hard to design a better alternative.
A PB file consists of a sequence of numbered fields; applications that want to use PB write protocol description (“.proto”) files describing what these fields mean. One of the goals of PB is that old consumers should be able to read the data produced by newer producers, by simply skipping unexpected fields. This means in turn that consumers have to be able to find the end of a field without needing any out-of-band metadata about the field. Take a look at the wire format documentation to see how PB achieves that – seven-bit integer encoding, ZigZag, wire types, and embedded messages. Each of these is a sacrifice in both cleanliness and performance.
The “varint” encoding, that writes integers in seven bit groups with the high bit as a stop bit, is inefficient both for reading and writing; you can’t just send the bytes you have to the wire, you have to measure them out and check when you’ve finished with shifts; each check can also cause pipeline stalls. ZigZag is even worse – it shifts the whole number up in order to put the sign bit in the least significant bit, so there’s more work to do once the number is assembled. Wire types are a nasty wart; they’re not enough information to actually know what is being sent, and they are there only so that the length of what is being sent can be inferred properly where a field is unknown (and to work around the inefficiency of the “varint” encoding by providing special fixed-length integer encodings). And the only non-deprecated way of creating hierarchy is to embed a message as a string – which means that every data structure below the top level must be completely marshalled in memory before it is sent, so that its length can be measured.
So here’s my alternative proposal – same logical structure, different wire format. We encode the data as a stream of “atoms”. An atom is either a hunk of binary data, or a “special symbol”. Here’s how we decode an atom:
def readIntOrLen(lim, stream, i): if i < lim: return i return toUint(stream.read(1 + i - lim)) # Could force eg one-byte encoding to cover range [lim, lim+255] # instead of [0, 255] here but is it worth it? def readAtom(stream): firstByte = toUint(stream.read(1)) fieldId = readIntOrLen(12, stream, firstByte >> 4) if fieldId == 0: return specialSymbol(firstByte & 0x0f) dataLen = readIntOrLen(8, stream, firstByte & 0x0f) return fieldId, stream.read(dataLen)
So we make the head byte do as much work as possible; it has one nybble for the field ID, one nybble for the data length, and no “wire type”. If the nybble is too small, we overflow into the rest of the data stream, using as few bytes as possible but encoding in a natural 8-bit little-endian format. For structures that have fewer than 12 members, this adds only one byte of overhead on any field which is fewer than 8 bytes long. Right away, that’s 7-bit-footed varints, ZigZag, and wire types dispensed with. We’ve reserved fieldId 0 for special symbols; we can encode 16 different symbols.
To avoid the pre-marshalling problem, we re-introduce the grouping symbols that Google deprecated. A “start group” symbol should be followed by a field ID; this will normally be written in the following byte, but values above 251 will take up more bytes: in other words we read the group ID with
readIntOrLen(252, stream, toUint(stream.read(1))). This is then paired with an “end symbol”. Obviously these symbols act like brackets in SPKI and can be nested. So you can start writing a nested data structure just by emitting the start symbol, marshalling the structure directly to the output stream, and then emitting the end symbol; no pre-marshalling needed.
This format fixes the four problems I identified with the format, but we still have 14 special symbols left, and I see nice things to do with them.
Let specialSymbol(0) be NOP. Note that this is just the same as a zero byte, so now runs of zero bytes inbetween proper data has no effect. On disk, you can “zero out” bits of data you don’t need without having to copy up all the other data to take its place; or you can leave zero blocks in useful places (eg at the head of a file) where you might want to write real data later.
Let’s have “begin chunked” and “end chunked” symbols. All atoms in between should be hunks of binary data with the same field ID; the symbols mean these should be concatenated to get the final answer. Now you can start writing even an individual binary field before you know how long it’s going to be, useful if you’re generating it on the fly.
Let’s have a “begin metadata” symbol, which should be followed by a group; the group ID can identify what sort of metadata. Parsers must ignore metadata they don’t understand, but must fail if they see special symbols they don’t understand. We can put for example the protocol description itself in the metadata, suitably encoded, so that smart clients can use it eg to display the data meaningfully.
Let’s have a “version” symbol, which is followed by a version byte. If the version is unknown, the parser must fail. We should choose these symbols carefully, so that the first two bytes of the file can be an appropriate magic number not used by other formats.
We can trivially define a “canonical encoding” for signing: it is precisely the shortest encoding for the data given, with fields in sorted order by fieldID.
This seems to me to be nearly always better than Google’s proposal – are there angles I’ve missed?
Update: I wasn’t happy with the two-byte overhead for eight-byte fields; here’s an encoding that has a one-byte overhead for fields of 12 bytes or fewer.
def readIntOrLen(lim, stream, i): if i < lim: return i return toUint(stream.read(1 << (i - lim))) # Could force eg one-byte encoding to cover range [lim, lim+255] # instead of [0, 255] here but is it worth it? def readAtom(stream): firstByte = toUint(stream.read(1)) fieldId = readIntOrLen(13, stream, firstByte >> 4) if fieldId == 0: return specialSymbol(firstByte & 0x0f) dataLen = readIntOrLen(13, stream, firstByte & 0x0f) return fieldId, stream.read(dataLen)
This limits length descriptors to being 0, 1, 2, or 4 bytes long. Note that this limits individual chunks to 4GB. If you’re sending disk images, use the chunked encoding; the canonical encoding rules will need a special case for this.