Optimizing loops in C for higher numerical throughput and for fun

By: on November 27, 2013

We had here, in LShift, this typical C-vs-Fortran discussion which prompted me to follow up on it. In this holy war I stand by C and believe that a lot of opinions supporting the alleged superiority of Fortran in numerical throughput come from poor understanding of what actually can be done on the C side. So, I’d like to demonstrate here a couple of loop optimization techniques that can be used in C in order to maximize numerical throughput. Why loops? Because they often constitute the bulk of computations and have significant potential for utilizing architecture of modern CPUs. And if your code doesn’t spin in such loops, you should strive to make it do so, I hope, I will provide in this post convincing arguments as to why.

I use my laptop with a SandyBridge CPU, Core i5, running at 1.7GHz, and as a compiler, GCC 4.8. As a task, I’ve picked sum of squares of N vectors:

    y[i] = x1[i]^2 + x2[i]^2 + ... + xN[i]^2

It nicely fits the bill, sum of squares is a commonplace in numerical computations, uses addition and multiplication allowing to demonstrate mechanisms available in modern CPUs, and has the desired structure of:

    for i ...
        for j ...
            do stuff here

I use snippets of a bit simplified C here, and although all tested code is pure C, I also use chunks of actual assembly produced by GCC to illustrate arising issues. The assembly snippets are for SSE target as GCC tends to produce cleaner code for SSE than for AVX.

Ok, down to the business, our naive loop summing squares of N vectors could be:

    for(i = 0; i < VECTOR_LEN; i++)        // mulss   %xmm0, %xmm0
        for(j = 0; j < N; j++)             // addss   %xmm0, %xmm1
            acc[i] += v[j][i] * v[j][i];   // movss   %xmm1, (%rdi)

Disassembly reveals that XXXss instructions are used. This subset of SSE instructions is using one floating point word, not 4, in one go. This code is clearly not taking full advantage of SIMD units and still limits the throughput to 1xCPU_CLOCK. Since logically, it doesn’t matter which loop is inner, which outer, we can swap them while the algorithm remains valid. Now we have the “horizontal” loop:

    for(j = 0; j < N; j++)                  // mulps   %xmm0, %xmm0
        for(i = 0; i < VECTOR_LEN; i++)     // addps   %xmm0, %xmm1
            acc[i] += v[j][i] * v[j][i];    // movups  %xmm1, (%rdi, %rax)

Boom! This time, GCC unrolled the inner loop using full-width XXXps SSE instructions. This single change boosts performance to the expected SIMD_WIDTHxCPU_CLOCK benchmark, as it will be shown below. Too bad that GCC cannot automatically do this simple optimization for us, but as far as I remember, ICC can. Moving on, the next logical step would be to unroll calculations “vertically”, it should reduce number of memory reads/writes. An example of thus manually unrolled loop:

    for(j = 0; j <= N-4; j+=4)                 // movups  (%rdi, %r9), %xmm0
        for(i = 0; i < VECTOR_LEN; i++)        // mulps   %xmm1, %xmm1
            acc[i] += v[j+0][i] * v[j+0][i];   // addps   %xmm1, %xmm0
            acc[i] += v[j+1][i] * v[j+1][i];   // movups  %xmm0, (%rdi, %r9)   <== redundant write
            acc[i] += v[j+2][i] * v[j+2][i];   // movups  (%rax, %r9), %xmm1
            acc[i] += v[j+3][i] * v[j+3][i];   // mulps   %xmm1, %xmm1
                                               // addps   %xmm0, %xmm1
                                               // movups  %xmm1, (%rdi, %r9)   <== redundant write

Here, we see the infamous effect of pointer aliasing every so often brought up in C-vs-Fortran discussions. Compiler, for each line of calculations produces extra read / write instructions which defeat the intended purpose of vertical unrolling. Luckily, the solution is trivial, an extra variable in the inner loop, this makes compiler produce code which caches calculations in a register. Here is the “cached” loop:

    for(j = 0; j <= N-4; j+=4)             // movups  (%rcx, %r9), %xmm1   <== single reads
        for(i = 0; i < VECTOR_LEN; i++)    // movups  (%r8, %r9), %xmm0
            float tmp = acc[i];            // mulps   %xmm1, %xmm1         <== bulk calculations
            tmp += v[j+0][i] * v[j+0][i];  // addps   %xmm4, %xmm3
            tmp += v[j+1][i] * v[j+1][i];  // mulps   %xmm0, %xmm0
            tmp += v[j+2][i] * v[j+2][i];  // addps   %xmm3, %xmm2
            tmp += v[j+3][i] * v[j+3][i];  // addps   %xmm2, %xmm1
            acc[i] = tmp;                  // addps   %xmm1, %xmm0
                                           // movups  %xmm0, (%rdi, %r9)   <== single write

Now the block of resultant SSE operations is compact and doesn’t have the redundant memory accesses. The last optimization I’d like to introduce further leverages the capability of the modern CPU to parallelize independent streams of operations. In order to do this, we need to break dependency chains, in other words, split calculations into independent sequences being executed on separate registers and execution units, here is our “final” loop:

    for(j = 0; j <= N-8; j+=8)
        for(i = 0; i < VECTOR_LEN; i++)
            float tmp1 = acc[i];
            float tmp2 = v[j+0][i] * v[j+0][i];
            float tmp3 = v[j+1][i] * v[j+1][i];
            tmp1      += v[j+2][i] * v[j+2][i];
            tmp2      += v[j+3][i] * v[j+3][i];
            tmp3      += v[j+4][i] * v[j+4][i];
            tmp1      += v[j+5][i] * v[j+5][i];
            tmp2      += v[j+6][i] * v[j+6][i];
            tmp3      += v[j+7][i] * v[j+7][i];
            acc[i] = tmp1 + tmp2 + tmp3;

C code I used for testing all the above loops, is here. To rule out memory bandwidth issues as much as it was possible, I run tests for a bunch of vectors small enough to fit into L1 cache. Throughputs for single core:

                SSE              AVX
naive:       1733.4 MFLOPS    1696.6 MFLOPS    // 1xCPU_CLOCK barrier for scalar instructions
horizontal:  5963.6 MFLOPS    9419.8 MFLOPS    // 4xCPU_CLOCK and 8xCPU_CLOCK for SSE and AVX
unrolled:   11264.8 MFLOPS   11496.6 MFLOPS
cached:     14253.7 MFLOPS   15086.5 MFLOPS
final:      17985.4 MFLOPS   18210.4 MFLOPS    // Both, SSE and AVX settle at around 10xCPU_CLOCK

So it seems, this midrange laptop CPU could potentially get us some 35 GFLOPS out of its two cores without resorting to nothing more than simple changes in pure C.

Things to consider:

  • Why for SSE, we did manage to get throughput of 10xCPU_CLOCK even though SSE operates on 4-floats chunks? SandyBridge architecture has separate execution units for addition and multiplication capable of performing operations in full parallel, this effectively means, SandyBridge can perform as if it had fused add-mul, upping, in some situations the theoretical limit to 8xCPU_CLOCK for SSE and 16xCPU_CLOCK for AVX.
  • Why then, we have 10xCPU_CLOCK not 8xCPU_CLOCK for SSE? SandyBridge CPUs have TurboBoost feature which provides extra headroom under certain circumstances, TurboBoost, however, may be very limited, especially when you properly harness all cores of your CPU.
  • Why then, we didn’t get more than 10xCPU_CLOCK and why we hit the same wall for AVX? We hit an L1 memory bandwidth bottleneck, further memory access optimizations are needed, and this is when carefully handcrafted assembly code may squeeze even more out of CPU.
  • How generic / reliable these optimization techniques are? Well, it’s a product of code, compiler and CPU, so your mileage may vary. But if you really wanna get real kick for your buck, you will have to research and experiment with your own devices.
  • Last but not least, keep in mind that such changes are likely to produce slightly different outputs due to floating-point rounding.




  1. Jason Riedy says:

    Meanwhile, the Fortran programmer writes y = dot_product (x, x) and moves on to the interesting bits. Plus, if auto-parallelization is on, or if this is in an OpenMP workshare section…

  2. jarek says:

    I am glad you know of libraries facilitating numerical tasks, but they were not my subject matter.

  3. Nathan Kurz says:

    Interesting summary of SIMD and pointer aliasing. I thought I’d compare the results of GCC/CLang/Intel, but the source link is 404. Is there a new one?

  4. Homer Wilson says:

    Thanks, an enjoyable note. This link is broken, however: https://www.lshift.net/blog/wp-content/uploads/2013/10/ve.c

  5. Rafael says:

    The notation used is incredibly confusing. Is the presented multiplication supposed to be part of a usual mathematical operation (matrix mult, dot product etc.)? As it stands it seems to be some sort of element-wise multiplication of two equal VECTOR_LEN * N matrices. Not the most common operation in typical signal processing and trivial to write in SIMD, especially using GCC or clang vector extensions.

  6. David Wragg says:

    Hi Jacek,

    It is possible to get efficient SIMD code out of gcc without transforming your source code.

    First, C99 added the ‘restrict’ keyword, which addresses the pointer aliasing issues of C relative to Fortran. The rules for restrict are complicated, but the basic idea is simple: A dot product function can be written as:

    float dotprod(float * restrict a, float * restrict b, int len)
    int i;
    float total = 0;

    for (i = 0; i < len; i++)
    total += *a++ * *b++;

    return total;

    (I hope that remains somewhat readable after passing though wordpress.)

    The restrict keywords there tells the compiler that the a and b pointers will not alias.

    But if you compile that code with gcc with a simple '-O2', you still get poor code. gcc does have quite a sophisticated vectorization infrastructure these days, but it you need to pass some options to give gcc permission to use it. In particular, for the code above I needed -O3 (to trigger vectorization and -ftree-vect-loop-version), -ffast-math (to allow gcc to reorder the additions) and -funroll-all-loops (to tell gcc to aggressively unroll the loop). I'm using gcc 4.7.2 as supplied with Fedora 18. My final compilation command is:

    gcc -O3 -ffast-math -funroll-all-loops -std=c99 -march=corei7 -c dotprod.c

    The link to your test code gives a 404 for me, so I couldn't compare with your results, but the inner loop looks reasonable:

    movups (%rsi,%r8), %xmm9
    addl $8, %r10d
    movups 16(%rsi,%r8), %xmm10
    mulps (%rdi,%r8), %xmm9
    mulps 16(%rdi,%r8), %xmm10
    movups 32(%rsi,%r8), %xmm11
    movups 48(%rsi,%r8), %xmm12
    addps %xmm9, %xmm1
    mulps 32(%rdi,%r8), %xmm11
    mulps 48(%rdi,%r8), %xmm12
    movups 64(%rsi,%r8), %xmm13
    addps %xmm10, %xmm1
    movups 80(%rsi,%r8), %xmm14
    mulps 64(%rdi,%r8), %xmm13
    mulps 80(%rdi,%r8), %xmm14
    addps %xmm11, %xmm1
    movups 96(%rsi,%r8), %xmm15
    movups 112(%rsi,%r8), %xmm2
    mulps 96(%rdi,%r8), %xmm15
    addps %xmm12, %xmm1
    mulps 112(%rdi,%r8), %xmm2
    subq $-128, %r8
    cmpl %ebx, %r10d
    addps %xmm13, %xmm1
    addps %xmm14, %xmm1
    addps %xmm15, %xmm1
    addps %xmm2, %xmm1
    jb .L8

  7. GravMu says:

    In the final loop, wouldn’t it be faster to use in the lines from “float tmp2 =” and onwards the simpler “v[j][i]*v[j++][i]” and removing the need for the “j += 8” in the first for loop, or does the need to constantly save to the L1 cache invalidate this?

  8. Eric says:

    Too bad the links are broken.

  9. Someone says:

    aren’t your unrolled loops missing { } around the instruction blocks?

  10. jarek says:

    Apologies for the broken link, we had maintenance tasks going on. I just brought it back with the original source. In the source you will find commands I used to compile and to run the code.

  11. quentusrex says:

    Can you update with some of the Fortran numbers tested on the same hardware? Even if it’s only useful to compare unoptimized results to each other.

    Also can you update with performance numbers using the cblas implementation?

    BTW, great work here. Very interesting demonstration.

  12. Victor Eijkhout says:

    Interesting demonstration.

    I was sort of expecting there to be a final example where the tmp1/2/3 was created before the loop and only summed together after the loop. In your code you create vectors of length 3 (because every tmp gets assigned 3 times, not because there are 3 tmps). If you move the tmp creation & final summation outside the loop you don’t flush the floating point pipe after 3 operations.

    Maybe it gives no improvement, but I’m curious. Maybe the biggest gain is to be had from the SIMD instructions (obviously, a factor of 4) rather than the pipelining, but I’d be curious to see for sure.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>