In an earlier post I mentioned that I was using CorePy for my cryptographic fiddlings.
Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.
Well, I still think CorePy is very cool, but it turned out to introduce some tricky problems of its own, and by taking more care about how I use it I have sped up my program by nearly three orders of magnitude.
At first I thought that it was simply that my assembly was so fast that the Python excution time was dominating. In fact, Python is a little faster than I thought, and it was CorePy itself that was taking the bulk of the time. My inner loop executed in less than 1000 clock cycles, but the per-call overhead was far greater – more like 21,000 cycles. In addition to this per-call overhead, however, I had to reckon with a cost of nearly 1500 cycles for every 32-bit integer I loaded and stored from CorePy’s native arrays. This contrasts with a cost of less than 200 clock cycles to store one in a native Python list, or around 250 cycles for a NumPy array. If you have CorePy installed, you can see for yourself. So I had a very powerful machine, but I could only talk to it through a very narrow pipe.
My workaround can be found in the latest version of trivium-corepy. In addition to the assembly for computing Trivium itself, there’s now assembly specific to performing the cube attack. The code now takes a list of bit indexes to flip. Looping over these indicies it runs Trivium, XORs the output bits into a result area, and then flips the indexed bit in all 128 instances of Trivium inside the buffer before doing it all again. This means that each run lasts much longer, making the 21000 cycle overhead less significant. It’s still better to have more than one run, though, because of the cost of the writes needed to set up this index array – in fact, the sweet spot is reached when the total penalty from the writes to set up the index array is equal to the total penalty from the per-call overhead, because it is easy to double one by halving the other. So the indices are divided into three groups – those that will be handled in Python, those that will be handled by looping in assembly, and those that will be calculated in parallel by our 128 simultaneous instances of Trivium.
It’s with this technique that I’ve been able to find the problematic maxterms in the original Dinur and Shamir paper, and verify that the replacement maxterms I recently received from Dinur all work fine. I’ve even set the code on finding new maxterms, and found some for output bits up to 714, though the techniques I’m using are fairly basic (for example, I don’t yet take advantage of the fact that we get lots of output bits at once for free). Overall I think that I’m still glad of what CorePy gives me, but I think that it could be simpler still to write fast programs if those overheads could be significantly reduced.