We had here, in LShift, this typical C-vs-Fortran discussion which prompted me to follow up on it. In this holy war I stand by C and believe that a lot of opinions supporting the alleged superiority of Fortran in numerical throughput come from poor understanding of what actually can be done on the C side. So, I’d like to demonstrate here a couple of loop optimization techniques that can be used in C in order to maximize numerical throughput. Why loops? Because they often constitute the bulk of computations and have significant potential for utilizing architecture of modern CPUs. And if your code doesn’t spin in such loops, you should strive to make it do so, I hope, I will provide in this post convincing arguments as to why.
I use my laptop with a SandyBridge CPU, Core i5, running at 1.7GHz, and as a compiler, GCC 4.8. As a task, I’ve picked sum of squares of N vectors:
y[i] = x1[i]^2 + x2[i]^2 + ... + xN[i]^2
It nicely fits the bill, sum of squares is a commonplace in numerical computations, uses addition and multiplication allowing to demonstrate mechanisms available in modern CPUs, and has the desired structure of:
for i ... for j ... do stuff here
I use snippets of a bit simplified C here, and although all tested code is pure C, I also use chunks of actual assembly produced by GCC to illustrate arising issues. The assembly snippets are for SSE target as GCC tends to produce cleaner code for SSE than for AVX.
Ok, down to the business, our naive loop summing squares of N vectors could be:
for(i = 0; i < VECTOR_LEN; i++) // mulss %xmm0, %xmm0 for(j = 0; j < N; j++) // addss %xmm0, %xmm1 acc[i] += v[j][i] * v[j][i]; // movss %xmm1, (%rdi)
Disassembly reveals that XXXss instructions are used. This subset of SSE instructions is using one floating point word, not 4, in one go. This code is clearly not taking full advantage of SIMD units and still limits the throughput to 1xCPU_CLOCK. Since logically, it doesn’t matter which loop is inner, which outer, we can swap them while the algorithm remains valid. Now we have the “horizontal” loop:
for(j = 0; j < N; j++) // mulps %xmm0, %xmm0 for(i = 0; i < VECTOR_LEN; i++) // addps %xmm0, %xmm1 acc[i] += v[j][i] * v[j][i]; // movups %xmm1, (%rdi, %rax)
Boom! This time, GCC unrolled the inner loop using full-width XXXps SSE instructions. This single change boosts performance to the expected SIMD_WIDTHxCPU_CLOCK benchmark, as it will be shown below. Too bad that GCC cannot automatically do this simple optimization for us, but as far as I remember, ICC can. Moving on, the next logical step would be to unroll calculations “vertically”, it should reduce number of memory reads/writes. An example of thus manually unrolled loop:
for(j = 0; j <= N-4; j+=4) // movups (%rdi, %r9), %xmm0 for(i = 0; i < VECTOR_LEN; i++) // mulps %xmm1, %xmm1 acc[i] += v[j+0][i] * v[j+0][i]; // addps %xmm1, %xmm0 acc[i] += v[j+1][i] * v[j+1][i]; // movups %xmm0, (%rdi, %r9) <== redundant write acc[i] += v[j+2][i] * v[j+2][i]; // movups (%rax, %r9), %xmm1 acc[i] += v[j+3][i] * v[j+3][i]; // mulps %xmm1, %xmm1 // addps %xmm0, %xmm1 // movups %xmm1, (%rdi, %r9) <== redundant write
Here, we see the infamous effect of pointer aliasing every so often brought up in C-vs-Fortran discussions. Compiler, for each line of calculations produces extra read / write instructions which defeat the intended purpose of vertical unrolling. Luckily, the solution is trivial, an extra variable in the inner loop, this makes compiler produce code which caches calculations in a register. Here is the “cached” loop:
for(j = 0; j <= N-4; j+=4) // movups (%rcx, %r9), %xmm1 <== single reads for(i = 0; i < VECTOR_LEN; i++) // movups (%r8, %r9), %xmm0 float tmp = acc[i]; // mulps %xmm1, %xmm1 <== bulk calculations tmp += v[j+0][i] * v[j+0][i]; // addps %xmm4, %xmm3 tmp += v[j+1][i] * v[j+1][i]; // mulps %xmm0, %xmm0 tmp += v[j+2][i] * v[j+2][i]; // addps %xmm3, %xmm2 tmp += v[j+3][i] * v[j+3][i]; // addps %xmm2, %xmm1 acc[i] = tmp; // addps %xmm1, %xmm0 // movups %xmm0, (%rdi, %r9) <== single write
Now the block of resultant SSE operations is compact and doesn’t have the redundant memory accesses. The last optimization I’d like to introduce further leverages the capability of the modern CPU to parallelize independent streams of operations. In order to do this, we need to break dependency chains, in other words, split calculations into independent sequences being executed on separate registers and execution units, here is our “final” loop:
for(j = 0; j <= N-8; j+=8) for(i = 0; i < VECTOR_LEN; i++) float tmp1 = acc[i]; float tmp2 = v[j+0][i] * v[j+0][i]; float tmp3 = v[j+1][i] * v[j+1][i]; tmp1 += v[j+2][i] * v[j+2][i]; tmp2 += v[j+3][i] * v[j+3][i]; tmp3 += v[j+4][i] * v[j+4][i]; tmp1 += v[j+5][i] * v[j+5][i]; tmp2 += v[j+6][i] * v[j+6][i]; tmp3 += v[j+7][i] * v[j+7][i]; acc[i] = tmp1 + tmp2 + tmp3;
C code I used for testing all the above loops, is here. To rule out memory bandwidth issues as much as it was possible, I run tests for a bunch of vectors small enough to fit into L1 cache. Throughputs for single core:
SSE AVX naive: 1733.4 MFLOPS 1696.6 MFLOPS // 1xCPU_CLOCK barrier for scalar instructions horizontal: 5963.6 MFLOPS 9419.8 MFLOPS // 4xCPU_CLOCK and 8xCPU_CLOCK for SSE and AVX unrolled: 11264.8 MFLOPS 11496.6 MFLOPS cached: 14253.7 MFLOPS 15086.5 MFLOPS final: 17985.4 MFLOPS 18210.4 MFLOPS // Both, SSE and AVX settle at around 10xCPU_CLOCK
So it seems, this midrange laptop CPU could potentially get us some 35 GFLOPS out of its two cores without resorting to nothing more than simple changes in pure C.
Things to consider:
- Why for SSE, we did manage to get throughput of 10xCPU_CLOCK even though SSE operates on 4-floats chunks? SandyBridge architecture has separate execution units for addition and multiplication capable of performing operations in full parallel, this effectively means, SandyBridge can perform as if it had fused add-mul, upping, in some situations the theoretical limit to 8xCPU_CLOCK for SSE and 16xCPU_CLOCK for AVX.
- Why then, we have 10xCPU_CLOCK not 8xCPU_CLOCK for SSE? SandyBridge CPUs have TurboBoost feature which provides extra headroom under certain circumstances, TurboBoost, however, may be very limited, especially when you properly harness all cores of your CPU.
- Why then, we didn’t get more than 10xCPU_CLOCK and why we hit the same wall for AVX? We hit an L1 memory bandwidth bottleneck, further memory access optimizations are needed, and this is when carefully handcrafted assembly code may squeeze even more out of CPU.
- How generic / reliable these optimization techniques are? Well, it’s a product of code, compiler and CPU, so your mileage may vary. But if you really wanna get real kick for your buck, you will have to research and experiment with your own devices.
- Last but not least, keep in mind that such changes are likely to produce slightly different outputs due to floating-point rounding.