<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: SSE intrinsics optimizations in popular compilers</title>
	<atom:link href="http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/</link>
	<description>Just another coder</description>
	<lastBuildDate>Mon, 23 Aug 2010 19:28:29 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<item>
		<title>By: Michael</title>
		<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/comment-page-1/#comment-6704</link>
		<dc:creator>Michael</dc:creator>
		<pubDate>Wed, 17 Mar 2010 11:28:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.liranuna.com/?p=984#comment-6704</guid>
		<description>&lt;blockquote&gt;&quot;I keep hearing the catch-phrase among programmers that “the compiler is better than you [think].”&lt;/blockquote&gt;

It&#039;s such a silly statement.   Perhaps it is true for those who say it ...

That you have to even use intrinsics in the first place is a pretty good indicator compilers still have a long long way to go.

Interesting article, and good to see gcc is kicking bottom here.  I don&#039;t think it&#039;s that they know the cpu more than intel does, they can just &#039;afford&#039; more resources, and can share implementation with other cpu&#039;s like power or cell&#039;s spu.

I&#039;m kind of surprised sse doesn&#039;t speed things up more though, is it just that the sisd code runs is so fast or that the simd unit isn&#039;t that fast? (compared to say cell/spu).</description>
		<content:encoded><![CDATA[<blockquote><p>&#8220;I keep hearing the catch-phrase among programmers that “the compiler is better than you [think].”</p></blockquote>
<p>It&#8217;s such a silly statement.   Perhaps it is true for those who say it &#8230;</p>
<p>That you have to even use intrinsics in the first place is a pretty good indicator compilers still have a long long way to go.</p>
<p>Interesting article, and good to see gcc is kicking bottom here.  I don&#8217;t think it&#8217;s that they know the cpu more than intel does, they can just &#8216;afford&#8217; more resources, and can share implementation with other cpu&#8217;s like power or cell&#8217;s spu.</p>
<p>I&#8217;m kind of surprised sse doesn&#8217;t speed things up more though, is it just that the sisd code runs is so fast or that the simd unit isn&#8217;t that fast? (compared to say cell/spu).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: LiraNuna</title>
		<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/comment-page-1/#comment-6667</link>
		<dc:creator>LiraNuna</dc:creator>
		<pubDate>Wed, 16 Dec 2009 04:20:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.liranuna.com/?p=984#comment-6667</guid>
		<description>If you&#039;re getting a lot of cvtss2sd, that means you&#039;re using the double-typed math functions, such as sin instead of sinf, and GCC does what you requested it - because sin takes double and returns a double, so GCC can&#039;t avoid that conversion (except when using -ffast-math).

If your target is x86, try using -mfpmath=sse,387 for single-scalar operations.

On an additional note, llvm-gcc seems to &lt;strong&gt;pass all tests&lt;/strong&gt; except comparison prediction. 

Extra note, GCC 4.5 will have plugin support. I plan to write a plugin that will enhance the SSE generation code (mainly swizzle merging and branch prediction with SSE vectors).</description>
		<content:encoded><![CDATA[<p>If you&#8217;re getting a lot of cvtss2sd, that means you&#8217;re using the double-typed math functions, such as sin instead of sinf, and GCC does what you requested it &#8211; because sin takes double and returns a double, so GCC can&#8217;t avoid that conversion (except when using -ffast-math).</p>
<p>If your target is x86, try using -mfpmath=sse,387 for single-scalar operations.</p>
<p>On an additional note, llvm-gcc seems to <strong>pass all tests</strong> except comparison prediction. </p>
<p>Extra note, GCC 4.5 will have plugin support. I plan to write a plugin that will enhance the SSE generation code (mainly swizzle merging and branch prediction with SSE vectors).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Xo Wang</title>
		<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/comment-page-1/#comment-6666</link>
		<dc:creator>Xo Wang</dc:creator>
		<pubDate>Wed, 16 Dec 2009 02:41:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.liranuna.com/?p=984#comment-6666</guid>
		<description>Oh, one gripe I do have is with the math code generated by GCC in -mfpmath=sse mode. It does &lt;em&gt;tons&lt;/em&gt; of cvtss2sd and cvtsd2ss and xmm  stack  x87 moves even when I&#039;m only using single-precision floats. With all the great compile-time evaluation and register allocation it has, I can&#039;t believe there is so much inefficiency in plain math code.</description>
		<content:encoded><![CDATA[<p>Oh, one gripe I do have is with the math code generated by GCC in -mfpmath=sse mode. It does <em>tons</em> of cvtss2sd and cvtsd2ss and xmm  stack  x87 moves even when I&#8217;m only using single-precision floats. With all the great compile-time evaluation and register allocation it has, I can&#8217;t believe there is so much inefficiency in plain math code.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Xo Wang</title>
		<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/comment-page-1/#comment-6665</link>
		<dc:creator>Xo Wang</dc:creator>
		<pubDate>Wed, 16 Dec 2009 02:31:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.liranuna.com/?p=984#comment-6665</guid>
		<description>I have to agree that GCC (I&#039;m using 4.4.1-tdm-2 and &lt;a href=&quot;http://www.cadforte.com/system64.html&quot; rel=&quot;nofollow&quot;&gt;WPG 4.5.0&lt;/a&gt;) does a wonderful job turning intrinsic code into assembly. I converted an inline asm 4x4 matrix multiply routine into intrinsics and noticed that the output was nearly identical to my handcoded original, including instruction pairing/interleaving, with the exception of using different xmm registers.

In fact, it actually became more efficient because GCC inline asm required explicit load/unload into registers (loop counters, addresses, etc.) to be passed into the inline asm block, while the intrinsics-generated code used the registers from the preceding code.

Finally, if you have labels in inline assembly, the block can&#039;t be inside an inline function, since the label might end up appearing twice in the same asm file. My previous solution was to jump by a manually-calculated offset to where the label would be---very tedious.

Basically speaking, GCC 4.4 has made it easier and more efficient to code simple to understand and somewhat portable vector routines, than to write them in straight C/C++ and pray that the vectorizer picks up the loop (which is something they should work on now).</description>
		<content:encoded><![CDATA[<p>I have to agree that GCC (I&#8217;m using 4.4.1-tdm-2 and <a href="http://www.cadforte.com/system64.html" rel="nofollow">WPG 4.5.0</a>) does a wonderful job turning intrinsic code into assembly. I converted an inline asm 4&#215;4 matrix multiply routine into intrinsics and noticed that the output was nearly identical to my handcoded original, including instruction pairing/interleaving, with the exception of using different xmm registers.</p>
<p>In fact, it actually became more efficient because GCC inline asm required explicit load/unload into registers (loop counters, addresses, etc.) to be passed into the inline asm block, while the intrinsics-generated code used the registers from the preceding code.</p>
<p>Finally, if you have labels in inline assembly, the block can&#8217;t be inside an inline function, since the label might end up appearing twice in the same asm file. My previous solution was to jump by a manually-calculated offset to where the label would be&#8212;very tedious.</p>
<p>Basically speaking, GCC 4.4 has made it easier and more efficient to code simple to understand and somewhat portable vector routines, than to write them in straight C/C++ and pray that the vectorizer picks up the loop (which is something they should work on now).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: LiraNuna</title>
		<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/comment-page-1/#comment-6610</link>
		<dc:creator>LiraNuna</dc:creator>
		<pubDate>Sat, 29 Aug 2009 20:26:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.liranuna.com/?p=984#comment-6610</guid>
		<description>non: What GCC version are you talking about? I agree GCC 3.4.x (MinGW&#039;s version) is truly horrible when it comes to register allocation, but that was revised twice in both gcc 4.0 (SSA trees) And 4.4 with the new register allocator (called IRA) which produces code that imo looks like hand coded assembly.</description>
		<content:encoded><![CDATA[<p>non: What GCC version are you talking about? I agree GCC 3.4.x (MinGW&#8217;s version) is truly horrible when it comes to register allocation, but that was revised twice in both gcc 4.0 (SSA trees) And 4.4 with the new register allocator (called IRA) which produces code that imo looks like hand coded assembly.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: non</title>
		<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/comment-page-1/#comment-6609</link>
		<dc:creator>non</dc:creator>
		<pubDate>Fri, 28 Aug 2009 13:12:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.liranuna.com/?p=984#comment-6609</guid>
		<description>Actually GCC SSE Intrinsics completely sucks when it comes to register utilisation. For a very good example, try to compare GCC and ICC output of John the ripper&#039;s SSE implementation of MD5.

GCC is incapable of handling it correctly, constantly moving data from/to the stack. The performance difference is up to 4 times faster for ICC !</description>
		<content:encoded><![CDATA[<p>Actually GCC SSE Intrinsics completely sucks when it comes to register utilisation. For a very good example, try to compare GCC and ICC output of John the ripper&#8217;s SSE implementation of MD5.</p>
<p>GCC is incapable of handling it correctly, constantly moving data from/to the stack. The performance difference is up to 4 times faster for ICC !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/comment-page-1/#comment-6601</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Sat, 08 Aug 2009 20:32:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.liranuna.com/?p=984#comment-6601</guid>
		<description>Thanks! I&#039;m just starting to look into using these instructions and this was a great read.</description>
		<content:encoded><![CDATA[<p>Thanks! I&#8217;m just starting to look into using these instructions and this was a great read.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
