A while ago I started dabbling in cryptography a bit, and inevitably, I ended up toying with performance of the related algorithms and code. In this post I’d like to share my approach to optimizing BouncyCastle’s implementation of Salsa20.
A few observations regarding Salsa20 (and presumably other modern ciphers) in the context of performance:
- Salsa20 uses 32bit integers, 32bit CPUs are dirt cheap nowadays, there is no reason to design an algo based on smaller words.
- It produces a pseudo-random stream in a block based fashion, 16 ints per block. A good opportunity to leverage multiple cores and multiple execution units in a core.
- It only uses basic, fast operations on ints, add, xor and shift.
As far as I could measure it, BouncyCastle’s Salsa20 implementation works out at around 20-22 CPU cycles / byte with latest JVMs. Fastest C implementations, according to DJB’s website can make it at around 4-5 CPU cycles / byte. A noticeable gap, let’s see why this is and what can be done with it.
Salsa20Engine.java from BouncyCastle sources can be found here. There are two hotspots in it.
First, salsaCore method, the actual implementation of the Salsa20 cipher. It produces a chunk of pseudo random ints and stores it in an internal buffer of bytes. One of the recent commits is an optimisation of this part of code. Buffering ints in variables as seen in the commit can potentially reduce a number of memory lookups in comparison to manipulating array elements as well as a number of boundary checks JVM has to perform.
Unfortunately, because algorithhm requires 16 such variables, JIT is likely to produce extra stores / loads to handle them all. Furthermore, underneath this still is good old serial code with no utilization of SIMD execution units. JITs in Java7 and Java8 can utilise SSE/AVX but are quite finicky about when and how they do it and code in Salsa20Engine.java doesn’t compel JIT to make use of SIMD instructions. As comment in the commit says, this optimization yields about 15%, and it is about so far we can go this way. This part of code nonetheless has a potential of yielding more with SIMD but it has to be approached from a different angle. The topic of SIMD use in JVM doesn’t seem to be well covered so I had to resort to experimentation and analysis of the JIT’s disassembly. To explain it all in proper detail would take way too much space to fit in a single post. So, I share a full optimized source code and hope it speaks for itself instead. The last, somewhat generic, note on it is that restructuring execution flow so that it uses SIMD entails extra data rearrangements which in turn take up extra CPU and reduce gains. This often is the case when we try to optimise a part of the picture without changing too much, or when we just cannot carry out deeper structural changes.
The second hotspot, in processBytes method, which implements an API entry, does xor’ing of an input stream of bytes with the sequence of bytes produced by salsaCore. Problem is, Salsa20 algorithm operates on 32-bit ints whereas API takes and outputs streams of bytes. As I mentioned before, Salsa20Engine.java converts ints produced by algorithm into a buffer of bytes which in turn is used to xor an input buffer of bytes. The xoring itself is done byte by byte and JIT in fact does produce code that processes it all in 8-bit chunks (including the costly loads / stores). A better approach is to keep the ints produced by algorithm as ints and use them as input bytes go, xor’ing input bytes with respective quarters of precalculated ints.
To test it all, FasterSalsa20Engine.java needs to be dropped alongside Salsa20Engine.java (into org.bouncycastle.crypto.engines package path), and SalsaTest.java needs to be compiled and run against this modified BouncyCastle.
On my laptop with a SandyBridge CPU and Java 1.8.0_11, an example output for larger blocks shows an average gain of 200-220%:
Salsa20 30.6kB :: min= 104mb/s avg= 109mb/s max= 111mb/s FasterSalsa20 30.6kB :: min= 221mb/s avg= 227mb/s max= 241mb/s Salsa20 86.5kB :: min= 99mb/s avg= 99mb/s max= 99mb/s FasterSalsa20 86.5kB :: min= 239mb/s avg= 239mb/s max= 239mb/s Salsa20 15.6kB :: min= 92mb/s avg= 92mb/s max= 92mb/s FasterSalsa20 15.6kB :: min= 231mb/s avg= 231mb/s max= 231mb/s Salsa20 72.4kB :: min= 93mb/s avg= 100mb/s max= 111mb/s FasterSalsa20 72.4kB :: min= 200mb/s avg= 207mb/s max= 221mb/s Salsa20 3.8kB :: min= 96mb/s avg= 97mb/s max= 98mb/s FasterSalsa20 3.8kB :: min= 140mb/s avg= 193mb/s max= 207mb/s