Lately I have been playing a lot with SSE optimizations and I really enjoy it so far – using functions to tell the compiler what instructions to use makes you feel the power in your finger tips. At first I was naive and thought the compiler will do exactly what it’s being told, assuming that you know what you’re doing – looking at the SSE intrinsic header file was mostly a bunch of calls to internal GCC functions or ‘extern’ in MSVC, suggesting that the compiler will simply follow your leadership.
I assumed wrong – the compiler will take the liberty to optimized your code even further – at points you wouldn’t even think about, though I have noticed that is not always the case with MSVC. MSVC will sometimes behave too trusting at the coder even when optimizations obviously could be made. After grasping the concept of SSE and what it could do, I quickly realized MSVC won’t optimize as good as GCC 4.x or ICC would.
I read a lot of forums about people who want to gain speed by using SSE to optimize their core math operations such as a 4D vector or a 4×4 matrix. While SSE will notably boost performance by about 10-30% depending on usage, there is no magic switch to tell the compiler to optimize your code to use SSE for you, so you need to know how to use intrinsics while actually optimizing along the way, while carefully examining the resulting assembly code.
This article will closely inspect and analyze the assembly output of 3 major compilers – GCC 4.x targeting Linux (4.3.3 in specific), the latest (stable) MSVC 2008 (Version 9.0.30729.1 SP1 in particular) and ICC 11.1.