SSE intrinsics optimizations in popular compilers

Lately I have been playing a lot with SSE optimizations and I really enjoy it so far – using functions to tell the compiler what instructions to use makes you feel the power in your finger tips. At first I was naive and thought the compiler will do exactly what it’s being told, assuming that you know what you’re doing – looking at the SSE intrinsic header file was mostly a bunch of calls to internal GCC functions or ‘extern’ in MSVC, suggesting that the compiler will simply follow your leadership.

I assumed wrong – the compiler will take the liberty to optimize your code even further – at points you wouldn’t even think about, though I have noticed that is not always the case with MSVC. MSVC will sometimes behave too trusting at the coder even when optimizations obviously could be made. After grasping the concept of SSE and what it could do, I quickly realized MSVC won’t optimize as good as GCC 4.x or ICC would.

I read a lot of forums about people who want to gain speed by using SSE to optimize their core math operations such as a 4D vector or a 4×4 matrix. While SSE will notably boost performance by about 10-30% depending on usage, there is no magic switch to tell the compiler to optimize your code to use SSE for you, so you need to know how to use intrinsics while actually optimizing along the way, while carefully examining the resulting assembly code.

This article will closely inspect and analyze the assembly output of 3 major compilers – GCC 4.x targeting Linux (4.3.3 in specific), the latest (stable) MSVC 2008 (Version 9.0.30729.1 SP1 in particular) and ICC 11.1.

I’ll start by declaring the options I give for each compiler – I am keeping it minimal and simple yet enough to output sane and optimized code.

GCC command line:

gcc -O2 -msse test.c -S -o test.asm

MSVC command line:

cl  /O2 /arch:SSE /c /FA test.c

ICC’s command line:

icc -O2 -msse test.c -S -o test.asm

MSVC automatically generates a file called test.asm, so no need to specify output file. Regardless of that, note the remarkable resemblance of the commands…

Basics

Let’s begin with a simple assignment test:

#include <xmmintrin.h>
 
extern void printv(__m128 m);
 
int main() {
    __m128 m = _mm_set_ps(4, 3, 2, 1);
    __m128 z = _mm_setzero_ps();
 
    printv(m);
    printv(z);
 
    return 0;
}

This will assign m to be [1, 2, 3, 4] and z to be [0, 0, 0, 0]. Please note that the undefined “extern” function ‘printv’ is to force the compilers to not optimize out the variable and to “prove” that they are used, since we only assemble in both compilers, there is no need to actually define printv.

The variable m is actually const, but we didn’t hint to compiler. The compiler should understand ‘m’ does not change and move it into the const data section (.text). The zero vector should use the xorps opcode to generate a fast zero vector without trading off const memory (x XOR x is always 0).

The output:

MSVC:
    movss   xmm2, DWORD PTR __real@40400000 ; 3.0f
    movss   xmm3, DWORD PTR __real@40800000 ; 4.0f
    movss   xmm0, DWORD PTR __real@3f800000 ; 1.0f
    movss   xmm1, DWORD PTR __real@40000000 ; 2.0f
    unpcklps xmm0, xmm2
    unpcklps xmm1, xmm3
    unpcklps xmm0, xmm1
    call    _printv
    xorps   xmm0, xmm0
    call    _printv
 
GCC:
    movaps  .LC0, %xmm0
    call    printv
    xorps   %xmm0, %xmm0
    call    printv
 
    .LC0:
        .long   1065353216 ; 1.0f
        .long   1073741824 ; 2.0f
        .long   1077936128 ; 3.0f
        .long   1082130432 ; 4.0f
 
ICC:
    movaps    _2il0floatpacket.0, %xmm0
    call      printv
    xorps     %xmm0, %xmm0
    call      printv
 
    _2il0floatpacket.0:
        .long   0x3f800000,0x40000000,0x40400000,0x40800000

Both GCC and ICC understood that the variable ‘m’ is const, and moved it to the .text (const) section. MSVC however chose to use 4 xmm registers to create ‘m’ – it not only wrote to valuable registers that in a real application are crucial to have, it also forces the use of the stack if those registers actually contained information, which is common when inlining. It will also invalidate cache usage, since the data is in the opcode, effectively eliminating future prefetches. All compilers however, used xorps to create a zero vector, which is pleasing to see.

Arithmetic prediction

Next test is arithmetic prefiction. The test will see how the compiler deals with predefined SSE operations, such as arithmetic, much like predefined integer operations. The compiler should predict and precompute operations such as ‘1+1’ and use the answer directly instead of making the CPU compute a static answer. The test is as follows:

#include <xmmintrin.h>
 
extern void printv(__m128 m);
 
int main() {
    __m128 m = _mm_set_ps(-4, -3, -2, -1);
    __m128 one = _mm_set1_ps(1.0f);
 
    printv(_mm_and_ps(m, _mm_setzero_ps())); // Always a zero vector
    printv(_mm_or_ps(m, _mm_set1_ps(-0.0f))); // Negate all (nop, all negative)
    printv(_mm_add_ps(m, _mm_setzero_ps())); // Add 0 (nop; x+0=x)
    printv(_mm_sub_ps(m, _mm_setzero_ps())); // Substruct 0 (nop; x-0=x)
    printv(_mm_mul_ps(m, one)); // Multiply by one (nop)
    printv(_mm_div_ps(m, one)); // Division by one (nop)
 
    return 0;
}

On the first test, the compiler should always send a zero xmm register to printv, since x & 0 is always equal to 0. The rest of the tests should always result into sending the same register, since all the tests are a simple way to create a nop (no operation).

The results:

MSVC:
    movss   xmm0, DWORD PTR __real@c0800000
    movss   xmm2, DWORD PTR __real@c0400000
    movss   xmm3, DWORD PTR __real@c0000000
    movss   xmm1, DWORD PTR __real@bf800000
    unpcklps xmm3, xmm0
    xorps   xmm0, xmm0
    unpcklps xmm1, xmm2
    unpcklps xmm1, xmm3
    movaps  XMMWORD PTR tv129[esp+32], xmm0
    movaps  XMMWORD PTR _m$[esp+32], xmm1
    andps   xmm0, xmm1
    call    _printv
    movss   xmm0, DWORD PTR __real@80000000
    shufps  xmm0, xmm0, 0
    orps    xmm0, XMMWORD PTR _m$[esp+32]
    call    _printv
    movaps  xmm0, XMMWORD PTR tv129[esp+32]
    addps   xmm0, XMMWORD PTR _m$[esp+32]
    call    _printv
    movaps  xmm0, XMMWORD PTR _m$[esp+32]
    subps   xmm0, XMMWORD PTR tv129[esp+32]
    call    _printv
 
GCC:
    xorps   %xmm0, %xmm0
    call    printv
    movaps  .LC0(%rip), %xmm0
    call    printv
    movaps  .LC0(%rip), %xmm0
    call    printv
    movaps  .LC0(%rip), %xmm0
    call    printv
 
    .LC0:
        .long   3212836864
        .long   3221225472
        .long   3225419776
        .long   3229614080
 
ICC:
        xorps     %xmm0, %xmm0
        call      printv
    movaps    _2il0floatpacket.2, %xmm0
    orps      _2il0floatpacket.0, %xmm0
    call      printv
    movaps    _2il0floatpacket.0, %xmm0
    call      printv
    movaps    _2il0floatpacket.0, %xmm0
    call      printv
    movaps    _2il0floatpacket.0, %xmm0
    mulps     _2il0floatpacket.1, %xmm0
    call      printv
    movaps    _2il0floatpacket.0, %xmm0
    divps     _2il0floatpacket.1, %xmm0
    call      printv
 
    _2il0floatpacket.0:
        .long   0xbf800000,0xc0000000,0xc0400000,0xc0800000
    _2il0floatpacket.1:
        .long   0x3f800000,0x3f800000,0x3f800000,0x3f800000
    _2il0floatpacket.2:
        .long   0x80000000,0x80000000,0x80000000,0x80000000

The results are certainly interesting. MSVC has decided to not optimize the code and did exactly what it was told, resulting in redundant code. More should be noted: xorps (line 7) could’ve been moved after the unpcklps instruction (line 10) to take advantage of instruction pairing (when the processor executes the same opcode again, it’s usually faster, especially in SSE-land, where the CPU operates on large registers of 128bit). GCC’s code does exactly what we expect from a modern compiler; it performs static check for all operations. ICC seems to be selective on what it can determine, leaving out the redundant OR, multiplication and division while optimizing the others out.

Shuffles

Next test is regarding shuffles. There will be several tests regarding redundant shuffles that could be easily optimized out or merged, such as double reverses and subsequent shuffles

#include <xmmintrin.h>
 
extern void printv(__m128 m);
 
int main() {
    __m128 m = _mm_set_ps(4, 3, 2, 1);
    m = _mm_shuffle_ps(m, m, 0xE4); // NOP - shuffles to same order
    printv(m);
 
    m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector
    m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector again, NOP
    printv(m);
 
    m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector
    m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector again, NOP
    m = _mm_shuffle_ps(m, m, 0x1B); // All should be optimized to one shuffle
    printv(m);
 
    m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap pairs
    m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E
    printv(m);
 
    m = _mm_shuffle_ps(m, m, 0x55); // First element
    m = _mm_shuffle_ps(m, m, 0x55); // Redundant - since all are the same
    m = _mm_shuffle_ps(m, m, 0x55); // Let's stress it again
    m = _mm_shuffle_ps(m, m, 0x55); // And one last time
    printv(m);
 
    return 0;
}

The results here should be minimum shuffles. First two tests should have no shuffle at all. Third should only have one shuffle to reverse the vector (mask = 0x1B). Forth test should merge the two shuffles into one shuffle, from mask 0xC9 and 0x2D to mask 0x4E (swap pairs). Last test should be optimized to only one shuffle, since all are ending up selecting the same value.

MSVC:
    movss   xmm1, DWORD PTR __real@40800000 ; 4.0f
    movss   xmm2, DWORD PTR __real@40400000 ; 3.0f
    movss   xmm3, DWORD PTR __real@40000000 ; 2.0f
    movss   xmm0, DWORD PTR __real@3f800000 ; 1.0f
    unpcklps xmm3, xmm1
    unpcklps xmm0, xmm2
    unpcklps xmm0, xmm3
    shufps  xmm0, xmm0, 228         ; 000000e4H
    movaps  XMMWORD PTR _m$[esp+16], xmm0
    call    _printv
    movaps  xmm0, XMMWORD PTR _m$[esp+16]
    shufps  xmm0, xmm0, 27          ; 0000001bH
    shufps  xmm0, xmm0, 27          ; 0000001bH
    movaps  XMMWORD PTR _m$[esp+16], xmm0
    call    _printv
    movaps  xmm0, XMMWORD PTR _m$[esp+16]
    shufps  xmm0, xmm0, 27          ; 0000001bH
    shufps  xmm0, xmm0, 27          ; 0000001bH
    shufps  xmm0, xmm0, 27          ; 0000001bH
    movaps  XMMWORD PTR _m$[esp+16], xmm0
    call    _printv
    movaps  xmm0, XMMWORD PTR _m$[esp+16]
    shufps  xmm0, xmm0, 201         ; 000000c9H
    shufps  xmm0, xmm0, 45          ; 0000002dH
    movaps  XMMWORD PTR _m$[esp+16], xmm0
    call    _printv
    movaps  xmm0, XMMWORD PTR _m$[esp+16]
    shufps  xmm0, xmm0, 85          ; 00000055H
    shufps  xmm0, xmm0, 85          ; 00000055H
    shufps  xmm0, xmm0, 85          ; 00000055H
    shufps  xmm0, xmm0, 85          ; 00000055H
    call    _printv
 
GCC:
    movaps  .LC0, %xmm1
    movaps  %xmm1, %xmm0
    movaps  %xmm1, -24(%ebp)
    call    printv
    movaps  -24(%ebp), %xmm1
    movaps  %xmm1, %xmm0
    call    printv
    movaps  .LC1, %xmm0
    call    printv
    movaps  .LC1, %xmm1
    shufps  $201, %xmm1, %xmm1
    shufps  $45, %xmm1, %xmm1
    movaps  %xmm1, %xmm0
    movaps  %xmm1, -24(%ebp)
    call    printv
    movaps  -24(%ebp), %xmm1
    movaps  %xmm1, %xmm0
    shufps  $85, %xmm1, %xmm0
    call    printv
 
    .LC0:
        .long   1065353216 ; 1.0f
        .long   1073741824 ; 2.0f
        .long   1077936128 ; 3.0f
        .long   1082130432 ; 4.0f
 
    .LC1:
        .long   1082130432 ; 4.0f
        .long   1077936128 ; 3.0f
        .long   1073741824 ; 2.0f
        .long   1065353216 ; 1.0f
 
ICC:
    movaps    _2il0floatpacket.0, %xmm0
    addl      $4, %esp
    shufps    $228, %xmm0, %xmm0
    movaps    %xmm0, (%esp)
    call      printv
    movaps    (%esp), %xmm0
    shufps    $27, %xmm0, %xmm0
    shufps    $27, %xmm0, %xmm0
    movaps    %xmm0, (%esp)
    call      printv
    movaps    (%esp), %xmm0
    shufps    $27, %xmm0, %xmm0
    shufps    $27, %xmm0, %xmm0
    shufps    $27, %xmm0, %xmm0
    movaps    %xmm0, (%esp)
    call      printv
    movaps    (%esp), %xmm0
    shufps    $201, %xmm0, %xmm0
    shufps    $45, %xmm0, %xmm0
    movaps    %xmm0, (%esp)
    call      printv
    movaps    (%esp), %xmm0
    shufps    $85, %xmm0, %xmm0
    shufps    $85, %xmm0, %xmm0
    shufps    $85, %xmm0, %xmm0
    shufps    $85, %xmm0, %xmm0
    call      printv
 
    _2il0floatpacket.0:
        .long   0x3f800000,0x40000000,0x40400000,0x40800000

The results are interesting and quite surprising – GCC passed all of the tests but the shuffle merge while MSVC and ICC didn’t optimize any of the shuffles. Shame. Interesting to note that GCC chose to duplicate the original reverse vector for post operations instead of caching it in an extra xmm register. (Copying registers is faster than copying memory, even if it’s aligned).

Dynamic input

All of the previous tests were about the compiler being able to make decisions about static data. Now it’s time for functions, where input and output isn’t known. A notable example of a vector operation is normalization. Here is a function to normalize an SSE vector and return a normalized copy.

__m128 normalize(__m128 m) {
    __m128 l = _mm_mul_ps(m, m);
    l = _mm_add_ps(l, _mm_shuffle_ps(l, l, 0x4E));
    return _mm_div_ps(m, _mm_sqrt_ps(_mm_add_ps(l,
                                       _mm_shuffle_ps(l, l, 0x11))));
}

The function is really optimized. It gives hints the compiler what should be a temporary variable and what should be reused and takes a total of 7 operations.

The results we expect are perfect projection of the SSE intrinsics to assembly using only 3 vectors (original, length and square):

MSVC:
    main:
        movss   xmm1, DWORD PTR __real@3f800000
        xorps   xmm2, xmm2
        movss   xmm0, DWORD PTR __real@80000000
        movaps  xmm3, xmm1
        movaps  xmm4, xmm2
        shufps  xmm0, xmm0, 0
        movaps  XMMWORD PTR tv166[esp+16], xmm0
        unpcklps xmm2, xmm3
        unpcklps xmm1, xmm4
        unpcklps xmm1, xmm2
        andnps  xmm0, xmm1
        call    printv
        movss   xmm2, DWORD PTR __real@bf800000
        movss   xmm1, DWORD PTR __real@80000000
        movaps  xmm0, XMMWORD PTR tv166[esp+16]
        movaps  xmm3, xmm2
        movaps  xmm4, xmm1
        unpcklps xmm1, xmm3
        unpcklps xmm2, xmm4
        unpcklps xmm2, xmm1
        andnps  xmm0, xmm2
        call    printv
        movss   xmm1, DWORD PTR __real@bf800000
        movss   xmm2, DWORD PTR __real@80000000
        xorps   xmm3, xmm3
        movss   xmm4, DWORD PTR __real@3f800000
        movaps  xmm0, XMMWORD PTR tv166[esp+16]
        unpcklps xmm3, xmm1
        unpcklps xmm4, xmm2
        unpcklps xmm4, xmm3
        andnps  xmm0, xmm4
        call    printv
 
GCC:
    main:
        movaps  .LC1, %xmm0
        call    printv
        movaps  .LC1, %xmm0
        call    printv
        movaps  .LC1, %xmm0
        call    printv
 
    .LC0:
        .long   2147483648 ; -0.0f
        .long   2147483648 ; -0.0f
        .long   2147483648 ; -0.0f
        .long   2147483648 ; -0.0f
 
    .LC1:
        .long   1065353216 ; 1.0f
        .long   0          ; 0.0f
        .long   0          ; 0,0f
        .long   1065353216 ; 1.0f
 
ICC:
    movaps    _2il0floatpacket.7, %xmm0
    addl      $4, %esp
    movaps    %xmm0, (%esp)
    andnps    _2il0floatpacket.6, %xmm0
    call      printv
    movaps    (%esp), %xmm0
    andnps    _2il0floatpacket.8, %xmm0
    call      printv
    movaps    (%esp), %xmm0
    andnps    _2il0floatpacket.9, %xmm0
    call      printv
 
    _2il0floatpacket.6:
        .long   0x3f800000,0x00000000,0x00000000,0x3f800000
    _2il0floatpacket.7:
        .long   0x80000000,0x80000000,0x80000000,0x80000000
    _2il0floatpacket.8:
        .long   0xbf800000,0x80000000,0x80000000,0xbf800000
    _2il0floatpacket.9:
        .long   0x3f800000,0x00000000,0x80000000,0xbf800000
    _2il0floatpacket.11:
        .long   0x80000000,0x80000000,0x80000000,0x80000000

Good to see that all compilers are equal in here. These are exactly the results we expected.

Inline Functions

Next test would be combining function calls and static compile-time data to get inline functions. Inline functions should embed the function’s code into the calling routine and case-optimize if possible. A classic case of inline functions is ‘abs’:

#include <xmmintrin.h>
 
extern void printv(__m128 m);
 
/* This is called _mm_abs_ps because 'abs' is a built in function
   and C does not allow overloading */
inline __m128 _mm_abs_ps(__m128 m) {
    return _mm_andnot_ps(_mm_set1_ps(-0.0f), m);
}
 
int main() {
        // All positive
    printv(_mm_abs_ps(_mm_set_ps(1.0f, 0.0f, 0.0f, 1.0f)));
        // All negative
    printv(_mm_abs_ps(_mm_set_ps(-1.0f, -0.0f, -0.0f, -1.0f)));
        // Mixed
    printv(_mm_abs_ps(_mm_set_ps(-1.0f, -0.0f, 0.0f, 1.0f)));
}

The results we expect are perfect inlining of the function, resulting in the same vector over the three calls. A good compiler will also not duplicate the data over the three calls and reuse the same vector for the program, since the linker will most likely not do it.

MSVC:
    main:
        movss   xmm1, DWORD PTR __real@3f800000
        xorps   xmm2, xmm2
        movss   xmm0, DWORD PTR __real@80000000
        movaps  xmm3, xmm1
        movaps  xmm4, xmm2
        shufps  xmm0, xmm0, 0
        movaps  XMMWORD PTR tv166[esp+16], xmm0
        unpcklps xmm2, xmm3
        unpcklps xmm1, xmm4
        unpcklps xmm1, xmm2
        andnps  xmm0, xmm1
        call    printv
        movss   xmm2, DWORD PTR __real@bf800000
        movss   xmm1, DWORD PTR __real@80000000
        movaps  xmm0, XMMWORD PTR tv166[esp+16]
        movaps  xmm3, xmm2
        movaps  xmm4, xmm1
        unpcklps xmm1, xmm3
        unpcklps xmm2, xmm4
        unpcklps xmm2, xmm1
        andnps  xmm0, xmm2
        call    printv
        movss   xmm1, DWORD PTR __real@bf800000
        movss   xmm2, DWORD PTR __real@80000000
        xorps   xmm3, xmm3
        movss   xmm4, DWORD PTR __real@3f800000
        movaps  xmm0, XMMWORD PTR tv166[esp+16]
        unpcklps xmm3, xmm1
        unpcklps xmm4, xmm2
        unpcklps xmm4, xmm3
        andnps  xmm0, xmm4
        call    printv
 
GCC:
    main:
        movaps  .LC1, %xmm0
        call    printv
        movaps  .LC1, %xmm0
        call    printv
        movaps  .LC1, %xmm0
        call    printv
 
    .LC0:
        .long   2147483648 ; -0.0f
        .long   2147483648 ; -0.0f
        .long   2147483648 ; -0.0f
        .long   2147483648 ; -0.0f
 
    .LC1:
        .long   1065353216 ; 1.0f
        .long   0          ; 0.0f
        .long   0          ; 0,0f
        .long   1065353216 ; 1.0f
 
ICC:
    movaps    _2il0floatpacket.7, %xmm0
    addl      $4, %esp
    movaps    %xmm0, (%esp)
    andnps    _2il0floatpacket.6, %xmm0
    call      printv
    movaps    (%esp), %xmm0
    andnps    _2il0floatpacket.8, %xmm0
    call      printv
    movaps    (%esp), %xmm0
    andnps    _2il0floatpacket.9, %xmm0
    call      printv
 
    _2il0floatpacket.6:
        .long   0x3f800000,0x00000000,0x00000000,0x3f800000
    _2il0floatpacket.7:
        .long   0x80000000,0x80000000,0x80000000,0x80000000
    _2il0floatpacket.8:
        .long   0xbf800000,0x80000000,0x80000000,0xbf800000
    _2il0floatpacket.9:
        .long   0x3f800000,0x00000000,0x80000000,0xbf800000
    _2il0floatpacket.11:
        .long   0x80000000,0x80000000,0x80000000,0x80000000

This time each compiler chose it’s own way of optimizing. MSVC’s horrible assignment code in addition to it’s inability to predict static operations resulted in redundant code. ICC inlined the function, but kept some of the static data in the stack (while comfortably available on aligned read-only space) and did not perform any precomputation. GCC optimizes the code as we expected, but it “forgot” to remove the unnecessary helper vector (LC0) which is not used. This isn’t a big deal though because the linker will simply remove unreferenced const objects. GCC most likely kept it for when the inline function would have had use for it.

SSE comparison prediction

A good compiler should also predict branches and eliminate the unused code if the check is always true or false. SSE provides a way to compare 4 floats at once using the cmp*ps routines. If the result is true, the instruction puts a mask of 1s on the component. If it is false, 0. This instruction could be eliminated easily if the result is known during compile time – especially in inline functions. The test will implement the function ‘sign’ which returns 1, 0 or -1 per component.

#include <xmmintrin.h>
 
extern void printv(__m128 m);
 
inline __m128 sign(__m128 m) {
    return _mm_and_ps(_mm_or_ps(_mm_and_ps(m, _mm_set1_ps(-0.0f)),
                _mm_set1_ps(1.0f)),
              _mm_cmpneq_ps(m, _mm_setzero_ps()));
}
 
int main()
{
    __m128 m = _mm_setr_ps(1, -2, 3, -4);
 
    printv(_mm_cmpeq_ps(m, m)); // Equal to itself
    printv(_mm_cmpgt_ps(m, _mm_setzero_ps())); // Greater than zero
    printv(_mm_cmplt_ps(m, _mm_setzero_ps())); // Less than zero
 
    printv(sign(_mm_setr_ps( 1,  2,  3,  4))); // All 1's
    printv(sign(_mm_setr_ps(-1, -2, -3, -4))); // All -1's
    printv(sign(_mm_setr_ps( 0,  0,  0,  0))); // All 0's
    printv(sign(m)); // Mixed
}

A good compiler will eliminate those checks and will create a const copy of the result, especially in places of where all comparisons fail, resulting a zero vector. In the following test, we will check the generated code of several const comparison results without sacrificing application size.

MSVC:
    movss   xmm2, DWORD PTR __real@40400000
    movss   xmm3, DWORD PTR __real@c0800000
    movss   xmm0, DWORD PTR __real@3f800000
    movss   xmm1, DWORD PTR __real@c0000000
    unpcklps xmm0, xmm2
    unpcklps xmm1, xmm3
    unpcklps xmm0, xmm1
    movaps  XMMWORD PTR _m$[esp+64], xmm0
    cmpeqps xmm0, xmm0
    call    _printv
    xorps   xmm0, xmm0
    movaps  XMMWORD PTR tv258[esp+64], xmm0
    cmpltps xmm0, XMMWORD PTR _m$[esp+64]
    call    _printv
    movaps  xmm0, XMMWORD PTR _m$[esp+64]
    cmpltps xmm0, XMMWORD PTR tv258[esp+64]
    call    _printv
    movss   xmm2, DWORD PTR __real@3f800000
    movss   xmm3, DWORD PTR __real@40400000
    movss   xmm4, DWORD PTR __real@40800000
    movss   xmm0, DWORD PTR __real@40000000
    movaps  xmm1, xmm2
    shufps  xmm2, xmm2, 0
    movaps  XMMWORD PTR tv271[esp+64], xmm2
    unpcklps xmm0, xmm4
    unpcklps xmm1, xmm3
    unpcklps xmm1, xmm0
    movss   xmm0, DWORD PTR __real@80000000
    shufps  xmm0, xmm0, 0
    movaps  XMMWORD PTR tv272[esp+64], xmm0
    andps   xmm0, xmm1
    orps    xmm0, xmm2
    movaps  xmm2, XMMWORD PTR tv258[esp+64]
    cmpneqps xmm2, xmm1
    andps   xmm0, xmm2
    call    _printv
    movss   xmm2, DWORD PTR __real@c0400000
    movss   xmm3, DWORD PTR __real@c0800000
    movss   xmm1, DWORD PTR __real@bf800000
    movss   xmm0, DWORD PTR __real@c0000000
    unpcklps xmm1, xmm2
    movaps  xmm2, XMMWORD PTR tv258[esp+64]
    unpcklps xmm0, xmm3
    unpcklps xmm1, xmm0
    movaps  xmm0, XMMWORD PTR tv272[esp+64]
    andps   xmm0, xmm1
    orps    xmm0, XMMWORD PTR tv271[esp+64]
    cmpneqps xmm2, xmm1
    andps   xmm0, xmm2
    call    _printv
    xorps   xmm1, xmm1
    movaps  xmm0, xmm1
    movaps  xmm2, xmm1
    movaps  xmm3, xmm1
    unpcklps xmm1, xmm2
    movaps  xmm2, XMMWORD PTR tv258[esp+64]
    unpcklps xmm0, xmm3
    unpcklps xmm1, xmm0
    movaps  xmm0, XMMWORD PTR tv272[esp+64]
    andps   xmm0, xmm1
    orps    xmm0, XMMWORD PTR tv271[esp+64]
    cmpneqps xmm2, xmm1
    andps   xmm0, xmm2
    call    _printv
    movaps  xmm2, XMMWORD PTR _m$[esp+64]
    movaps  xmm0, XMMWORD PTR tv272[esp+64]
    movaps  xmm1, XMMWORD PTR tv258[esp+64]
    andps   xmm0, xmm2
    orps    xmm0, XMMWORD PTR tv271[esp+64]
    cmpneqps xmm1, xmm2
    andps   xmm0, xmm1
    call    _printv
 
GCC:
    movaps  .LC2(%rip), %xmm0
    movaps  %xmm0, (%rsp)
    cmpeqps %xmm0, %xmm0
    call    printv
    xorps   %xmm0, %xmm0
    cmpltps (%rsp), %xmm0
    call    printv
    xorps   %xmm1, %xmm1
    movaps  (%rsp), %xmm0
    cmpltps %xmm1, %xmm0
    call    printv
    xorps   %xmm0, %xmm0
    cmpneqps    .LC3(%rip), %xmm0
    andps   .LC1(%rip), %xmm0
    call    printv
    movaps  .LC0(%rip), %xmm0
    xorps   %xmm1, %xmm1
    orps    .LC1(%rip), %xmm0
    cmpneqps    .LC4(%rip), %xmm1
    andps   %xmm1, %xmm0
    call    printv
    xorps   %xmm0, %xmm0
    cmpneqps    %xmm0, %xmm0
    andps   .LC1(%rip), %xmm0
    call    printv
    xorps   %xmm1, %xmm1
    movaps  (%rsp), %xmm0
    cmpneqps    %xmm1, %xmm0
    movaps  (%rsp), %xmm1
    andps   .LC0(%rip), %xmm1
    orps    .LC1(%rip), %xmm1
    andps   %xmm1, %xmm0
    call    printv
 
    .LC0:
        .long   2147483648
        .long   2147483648
        .long   2147483648
        .long   2147483648
    .LC1:
        .long   1065353216
        .long   1065353216
        .long   1065353216
        .long   1065353216
    .LC2:
        .long   1065353216
        .long   3221225472
        .long   1077936128
        .long   3229614080
    .LC3:
        .long   1065353216
        .long   1073741824
        .long   1077936128
        .long   1082130432
    .LC4:
        .long   3212836864
        .long   3221225472
        .long   3225419776
        .long   3229614080
 
ICC:
    movaps    _2il0floatpacket.8, %xmm0
    addl      $4, %esp
    cmpeqps   %xmm0, %xmm0
    call      printv
    xorps     %xmm0, %xmm0
    cmpltps   _2il0floatpacket.8, %xmm0
    call      printv
    movaps    _2il0floatpacket.8, %xmm0
    xorps     %xmm1, %xmm1
    cmpltps   %xmm1, %xmm0
    call      printv
    movaps    _2il0floatpacket.9, %xmm0
    movaps    _2il0floatpacket.9, %xmm2
    andps     _2il0floatpacket.10, %xmm0
    xorps     %xmm1, %xmm1
    cmpneqps  %xmm1, %xmm2
    orps      _2il0floatpacket.11, %xmm0
    andps     %xmm2, %xmm0
    call      printv
    movaps    _2il0floatpacket.12, %xmm1
    movaps    _2il0floatpacket.10, %xmm0
    andps     %xmm1, %xmm0
    orps      _2il0floatpacket.11, %xmm0
    xorps     %xmm2, %xmm2
    cmpneqps  %xmm2, %xmm1
    andps     %xmm1, %xmm0
    call      printv
    xorps     %xmm0, %xmm0
    cmpneqps  %xmm0, %xmm0
    call      printv
    movaps    _2il0floatpacket.10, %xmm0
    movaps    _2il0floatpacket.8, %xmm1
    andps     %xmm1, %xmm0
    orps      _2il0floatpacket.11, %xmm0
    xorps     %xmm2, %xmm2
    cmpneqps  %xmm2, %xmm1
    andps     %xmm1, %xmm0
    call      printv
 
    _2il0floatpacket.8:
        .long   0x3f800000,0xc0000000,0x40400000,0xc0800000
    _2il0floatpacket.9:
        .long   0x3f800000,0x40000000,0x40400000,0x40800000
    _2il0floatpacket.10:
        .long   0x80000000,0x80000000,0x80000000,0x80000000
    _2il0floatpacket.11:
        .long   0x3f800000,0x3f800000,0x3f800000,0x3f800000
    _2il0floatpacket.12:
        .long   0xbf800000,0xc0000000,0xc0400000,0xc0800000
    _2il0floatpacket.14:
        .long   0x80000000,0x80000000,0x80000000,0x80000000
    _2il0floatpacket.15:
        .long   0x3f800000,0x3f800000,0x3f800000,0x3f800000

None of the compilers optimized the comparisons, which could benefit the code in a large extent, especially when inlined. It’s notable to mention that GCC merged some of constants, eliminating 2 of the vectors that ICC left. ICC and GCC both optimized useless ORs where possible while MSVC simply followed the code intrinsic by intrinsic.

Conclusion

I keep hearing the catch-phrase among programmers that “the compiler is better than you [think].” I completely disagree with it and object the use of it. Not only it makes novice programmers misunderstand it and give the compiler a lot of credit where it’s impossible to expect a compiler to optimize a case, it also makes more advance programmers become lazy and believe the compiler does know what it’s doing.

Proven here is a case using the so called ‘intrinsics’ to guide the compiler as opposed of instructing it. As seen by the above examples, only GCC (and to an extent, ICC) behaves the way we expect it to though it still misses a few of the cases (such as merging shuffles and predicting vector branches). MSVC is most likely the worst example of an SSE-guided compiler – not only it did not optimize any of the tests, it generated horrible assignment code which abused the stack most of the time and hurt performance by not utilizing cache properly.

If you are to code using SSE intrinsics, I advise you to take a closer look at the code if you want maximum performance. Taking advantage of SSE for speed will result a lot of satisfaction if used properly – instruction pairing, redundant arithmetic operations and redundant compares should be optimized by human beings most of the time and you should not rely on the compiler to do that. Compilers are given much more credit than they deserve.

As a side note about GCC’s near perfection in code generation – I was quite surprised seeing it surpass even Intel’s own compiler! It shows that even compiler writers, who know their own hardware and internal mechanisms, can overlook simple problems in the way humans think – redundancy in most cases. I highly recommend giving the newest GCC 4.4 a try, if you are on Linux, you most likely have GCC 4.3.x, or if your distribution is an early bird (Gentoo, Fedora…), you might already have it. Windows users are lucky enough to know that GCC 4.4 have been ported successfully to Windows on both the MinGW suite and the TDM suite. Mac users might have to compile gcc 4.4 themselves using Xcode (which is actually gcc 4.0.1).

Happy optimizing!

12 thoughts on “SSE intrinsics optimizations in popular compilers

  1. Ben August 8, 2009 / 13:32

    Thanks! I’m just starting to look into using these instructions and this was a great read.

  2. non August 28, 2009 / 06:12

    Actually GCC SSE Intrinsics completely sucks when it comes to register utilisation. For a very good example, try to compare GCC and ICC output of John the ripper’s SSE implementation of MD5.

    GCC is incapable of handling it correctly, constantly moving data from/to the stack. The performance difference is up to 4 times faster for ICC !

  3. LiraNuna August 29, 2009 / 13:26

    non: What GCC version are you talking about? I agree GCC 3.4.x (MinGW’s version) is truly horrible when it comes to register allocation, but that was revised twice in both gcc 4.0 (SSA trees) And 4.4 with the new register allocator (called IRA) which produces code that imo looks like hand coded assembly.

  4. Xo Wang December 15, 2009 / 19:31

    I have to agree that GCC (I’m using 4.4.1-tdm-2 and WPG 4.5.0) does a wonderful job turning intrinsic code into assembly. I converted an inline asm 4×4 matrix multiply routine into intrinsics and noticed that the output was nearly identical to my handcoded original, including instruction pairing/interleaving, with the exception of using different xmm registers.

    In fact, it actually became more efficient because GCC inline asm required explicit load/unload into registers (loop counters, addresses, etc.) to be passed into the inline asm block, while the intrinsics-generated code used the registers from the preceding code.

    Finally, if you have labels in inline assembly, the block can’t be inside an inline function, since the label might end up appearing twice in the same asm file. My previous solution was to jump by a manually-calculated offset to where the label would be—very tedious.

    Basically speaking, GCC 4.4 has made it easier and more efficient to code simple to understand and somewhat portable vector routines, than to write them in straight C/C++ and pray that the vectorizer picks up the loop (which is something they should work on now).

  5. Xo Wang December 15, 2009 / 19:41

    Oh, one gripe I do have is with the math code generated by GCC in -mfpmath=sse mode. It does tons of cvtss2sd and cvtsd2ss and xmm stack x87 moves even when I’m only using single-precision floats. With all the great compile-time evaluation and register allocation it has, I can’t believe there is so much inefficiency in plain math code.

  6. LiraNuna December 15, 2009 / 21:20

    If you’re getting a lot of cvtss2sd, that means you’re using the double-typed math functions, such as sin instead of sinf, and GCC does what you requested it – because sin takes double and returns a double, so GCC can’t avoid that conversion (except when using -ffast-math).

    If your target is x86, try using -mfpmath=sse,387 for single-scalar operations.

    On an additional note, llvm-gcc seems to pass all tests except comparison prediction.

    Extra note, GCC 4.5 will have plugin support. I plan to write a plugin that will enhance the SSE generation code (mainly swizzle merging and branch prediction with SSE vectors).

  7. Michael March 17, 2010 / 04:28

    “I keep hearing the catch-phrase among programmers that “the compiler is better than you [think].”

    It’s such a silly statement. Perhaps it is true for those who say it …

    That you have to even use intrinsics in the first place is a pretty good indicator compilers still have a long long way to go.

    Interesting article, and good to see gcc is kicking bottom here. I don’t think it’s that they know the cpu more than intel does, they can just ‘afford’ more resources, and can share implementation with other cpu’s like power or cell’s spu.

    I’m kind of surprised sse doesn’t speed things up more though, is it just that the sisd code runs is so fast or that the simd unit isn’t that fast? (compared to say cell/spu).

  8. corysama October 24, 2010 / 12:15

    Just for fun, I repeated your experiment in MSVC 2010 RTM. The compiler has significantly improved, but it still has not caught up to gcc. Here’s a summary:

    Basics: match gcc
    ArithmeticPrediction: match gcc for mul&div, others eliminated unpcklps ops
    Shuffles: no change, still bad
    Dynamic Input: still good
    InlineFunctions: almost matches ICC, but 1 extra movaps per printv
    ComparisonPrediction: 1st 3 match gcc, 2nd 3 eliminated the unpcklps ops

  9. B April 23, 2012 / 20:25

    Would be interesting to see how Clang does in the above test

  10. Sundaram November 9, 2012 / 07:25

    How did you measure the performance? I’d like to know in Linux what you did to measure it, on Windows people generally use QueryPerformanceCounter.

  11. Gabriele Giuseppini September 22, 2018 / 02:51

    Awesome writeup, thank you so much. I believe that nowadays MSVC 2017 does a far better job – I’m using libsimdpp and I see very well optimized code. If I find some time I’ll run your same tests against it and share the results.

Leave a Reply to Sundaram Cancel reply

Your email address will not be published. Required fields are marked *