Comments on: SSE intrinsics optimizations in popular compilers

By: C++ Team Blog | Game performance and compilation time improvements in Visual Studio 2019 - Microsoft Today

Tue, 19 Mar 2019 09:20:22 +0000

[…] examples can be found in this older blog post, which discusses the SIMD generation of several compilers – VS 2019 now handles all the cases as […]

By: Gabriele Giuseppini

Gabriele Giuseppini — Sat, 22 Sep 2018 09:51:47 +0000

Awesome writeup, thank you so much. I believe that nowadays MSVC 2017 does a far better job – I’m using libsimdpp and I see very well optimized code. If I find some time I’ll run your same tests against it and share the results.

By: Sundaram

Sundaram — Fri, 09 Nov 2012 14:25:59 +0000

How did you measure the performance? I’d like to know in Linux what you did to measure it, on Windows people generally use QueryPerformanceCounter.

By: B

Tue, 24 Apr 2012 03:25:48 +0000

Would be interesting to see how Clang does in the above test

By: corysama

corysama — Sun, 24 Oct 2010 19:15:44 +0000

Just for fun, I repeated your experiment in MSVC 2010 RTM. The compiler has significantly improved, but it still has not caught up to gcc. Here’s a summary:

Basics: match gcc
ArithmeticPrediction: match gcc for mul&div, others eliminated unpcklps ops
Shuffles: no change, still bad
Dynamic Input: still good
InlineFunctions: almost matches ICC, but 1 extra movaps per printv
ComparisonPrediction: 1st 3 match gcc, 2nd 3 eliminated the unpcklps ops

By: Michael

Michael — Wed, 17 Mar 2010 11:28:04 +0000

“I keep hearing the catch-phrase among programmers that “the compiler is better than you [think].”

It’s such a silly statement. Perhaps it is true for those who say it …

That you have to even use intrinsics in the first place is a pretty good indicator compilers still have a long long way to go.

Interesting article, and good to see gcc is kicking bottom here. I don’t think it’s that they know the cpu more than intel does, they can just ‘afford’ more resources, and can share implementation with other cpu’s like power or cell’s spu.

I’m kind of surprised sse doesn’t speed things up more though, is it just that the sisd code runs is so fast or that the simd unit isn’t that fast? (compared to say cell/spu).

By: LiraNuna

LiraNuna — Wed, 16 Dec 2009 04:20:20 +0000

If you’re getting a lot of cvtss2sd, that means you’re using the double-typed math functions, such as sin instead of sinf, and GCC does what you requested it – because sin takes double and returns a double, so GCC can’t avoid that conversion (except when using -ffast-math).

If your target is x86, try using -mfpmath=sse,387 for single-scalar operations.

On an additional note, llvm-gcc seems to pass all tests except comparison prediction.

Extra note, GCC 4.5 will have plugin support. I plan to write a plugin that will enhance the SSE generation code (mainly swizzle merging and branch prediction with SSE vectors).

By: Xo Wang

Xo Wang — Wed, 16 Dec 2009 02:41:17 +0000

Oh, one gripe I do have is with the math code generated by GCC in -mfpmath=sse mode. It does tons of cvtss2sd and cvtsd2ss and xmm stack x87 moves even when I'm only using single-precision floats. With all the great compile-time evaluation and register allocation it has, I can't believe there is so much inefficiency in plain math code.

By: Xo Wang

Xo Wang — Wed, 16 Dec 2009 02:31:39 +0000

I have to agree that GCC (I'm using 4.4.1-tdm-2 and WPG 4.5.0) does a wonderful job turning intrinsic code into assembly. I converted an inline asm 4x4 matrix multiply routine into intrinsics and noticed that the output was nearly identical to my handcoded original, including instruction pairing/interleaving, with the exception of using different xmm registers. In fact, it actually became more efficient because GCC inline asm required explicit load/unload into registers (loop counters, addresses, etc.) to be passed into the inline asm block, while the intrinsics-generated code used the registers from the preceding code. Finally, if you have labels in inline assembly, the block can't be inside an inline function, since the label might end up appearing twice in the same asm file. My previous solution was to jump by a manually-calculated offset to where the label would be---very tedious. Basically speaking, GCC 4.4 has made it easier and more efficient to code simple to understand and somewhat portable vector routines, than to write them in straight C/C++ and pray that the vectorizer picks up the loop (which is something they should work on now).

By: LiraNuna

LiraNuna — Sat, 29 Aug 2009 20:26:55 +0000

non: What GCC version are you talking about? I agree GCC 3.4.x (MinGW’s version) is truly horrible when it comes to register allocation, but that was revised twice in both gcc 4.0 (SSA trees) And 4.4 with the new register allocator (called IRA) which produces code that imo looks like hand coded assembly.