CodeMirror is a real-time code editor for your browser. I know that diff isn’t a format edited by humans, but I found myself needing the diff syntax highlight where other code is shown.
More on that soon.
While PHP offers some basic functions to handle paths, such as basename and dirname to resolute the (direct) parent and base name of a path, it does not offer any means of normalizing or combining a path if it’s on a remote file system that is not in the server’s reach. If the files are local, it offers the function realpath.
I didn’t like the case and decided to write a ’static’ utility class to handle file paths safely, without worrying about possible path masquerading from broken code.
I hope someone will find the result useful:
<?php /** * @class Path * * @brief Utility class that handles file and directory pathes * * This class handles basic important operations done to file system paths. * It safely renders relative pathes and removes all ambiguity from a relative path. * * @author Liran Nuna */ final class Path { /** * Returns the parent path of this path. * "/path/to/directory" will return "/path/to" * * @arg $path The path to retrieve the parent path from */ public static function dirname($path) { return dirname(self::normalize($path)); } /** * Returns the last item on the path. * "/path/to/directory" will return "directory" * * @arg $path The path to retrieve the base from */ public static function basename($path) { return basename(self::normalize($path)); } /** * Normalizes the path for safe usage * This function does several operations to the given path: * * Removes unnecessary slashes (///path//to/////directory////) * * Removes current directory references (/path/././to/./directory/./././) * * Renders relative pathes (/path/from/../to/somewhere/in/../../directory) * * @arg $path The path to normalize */ public static function normalize($path) { return array_reduce(explode('/', $path), create_function('$a, $b', ' if($a === 0) $a = "/"; if($b === "" || $b === ".") return $a; if($b === "..") return dirname($a); return preg_replace("/\/+/", "/", "$a/$b"); '), 0); } /** * Combines a list of pathes to one safe path * * @arg $root The path or array with values to combine into a single path * @arg ... Relative pathes to root or arrays * * @note This function works with multi-dimentional arrays recursively. */ public static function combine($root, $rel1) { $arguments = func_get_args(); return self::normalize(array_reduce($arguments, create_function('$a,$b', ' if(is_array($a)) $a = array_reduce($a, "Path::combine"); if(is_array($b)) $b = array_reduce($b, "Path::combine"); return "$a/$b"; '))); } /** * Empty, private constructor, to prevent instantiation */ private function __construct() { // Prevents instantiation } }
Usage of this class is very simple, Path::basename and Path::dirname perform the same operation as PHP’s native dirname and basename, but safer:
<?php // PHP's native basname will return '..' echo basename('/path/to/treasure/island/monster/../..') . "\n"; // Safe basename will return 'treasure' echo Path::basename('/path/to/treasure/island/monster/../..') . "\n"; // PHP's native dirname will return '/path/to/treasure/island/monster/..' echo dirname('/path/to/treasure/island/monster/../..') . "\n"; // Safe dirname will return '/path/to' echo Path::dirname('/path/to/treasure/island/monster/../..') . "\n";
Path::normalize will sanitize paths and return the safe real path even if it does not exist on the server:
<?php // Normalize will 'sanitize' this path // Result: '/path/to/candy/up/ahead/please/go/right' echo Path::normalize( '///../path//to/./monster/././/' . '//../candy/.//./up/ahead/.//./' . 'test//back/../..//please/go///' . '/left/./../right/123_test!/../' ) . "\n";
Lastly, Path::combine will combine paths from variable amount of strings and arrays to form one safe path:
<?php // Combine paths from a relative path and root // Result: '/var/www/www.site.com/index.html' echo Path::combine( '/var/www/www.site.com/', 'img/../css/jqueryui/../../index.html' ) . "\n"; // Combine will also take values from arrays // Result: '/path/to/directory/sub/TEST/test/lastDirectory/filename.ext' echo Path::combine( array( "/path/to", "folder/../directory" ), 'sub', array( array( array( 'TEST', 'test', ) ), 'lastDirectory', ), 'filename.ext' ) . "\n";
As always, code I post is under the WTFPL, so you can use it without any obligations.
]]>libellen also recieved an official svn repository, incorporating this patch. Get libellen latest sources from svn using:
svn co http://svn.liranuna.com/libellen/trunk ellen
Current revision is 4, so this release is named libellen r4.
]]>I assumed wrong – the compiler will take the liberty to optimized your code even further – at points you wouldn’t even think about, though I have noticed that is not always the case with MSVC. MSVC will sometimes behave too trusting at the coder even when optimizations obviously could be made. After grasping the concept of SSE and what it could do, I quickly realized MSVC won’t optimize as good as GCC 4.x or ICC would.
I read a lot of forums about people who want to gain speed by using SSE to optimize their core math operations such as a 4D vector or a 4×4 matrix. While SSE will notably boost performance by about 10-30% depending on usage, there is no magic switch to tell the compiler to optimize your code to use SSE for you, so you need to know how to use intrinsics while actually optimizing along the way, while carefully examining the resulting assembly code.
This article will closely inspect and analyze the assembly output of 3 major compilers – GCC 4.x targeting Linux (4.3.3 in specific), the latest (stable) MSVC 2008 (Version 9.0.30729.1 SP1 in particular) and ICC 11.1.
I’ll start by declaring the options I give for each compiler – I am keeping it minimal and simple yet enough to output sane and optimized code.
GCC command line:
gcc -O2 -msse test.c -S -o test.asm
MSVC command line:
cl /O2 /arch:SSE /c /FA test.c
ICC’s command line:
icc -O2 -msse test.c -S -o test.asm
MSVC automatically generates a file called test.asm, so no need to specify output file. Regardless of that, note the remarkable resemblance of the commands…
Let’s begin with a simple assignment test:
#include <xmmintrin.h> extern void printv(__m128 m); int main() { __m128 m = _mm_set_ps(4, 3, 2, 1); __m128 z = _mm_setzero_ps(); printv(m); printv(z); return 0; }
This will assign m to be [1, 2, 3, 4] and z to be [0, 0, 0, 0]. Please note that the undefined “extern” function ‘printv’ is to force the compilers to not optimize out the variable and to “prove” that they are used, since we only assemble in both compilers, there is no need to actually define printv.
The variable m is actually const, but we didn’t hint to compiler. The compiler should understand ‘m’ does not change and move it into the const data section (.text). The zero vector should use the xorps opcode to generate a fast zero vector without trading off const memory (x XOR x is always 0).
The output:
MSVC: movss xmm2, DWORD PTR __real@40400000 ; 3.0f movss xmm3, DWORD PTR __real@40800000 ; 4.0f movss xmm0, DWORD PTR __real@3f800000 ; 1.0f movss xmm1, DWORD PTR __real@40000000 ; 2.0f unpcklps xmm0, xmm2 unpcklps xmm1, xmm3 unpcklps xmm0, xmm1 call _printv xorps xmm0, xmm0 call _printv GCC: movaps .LC0, %xmm0 call printv xorps %xmm0, %xmm0 call printv .LC0: .long 1065353216 ; 1.0f .long 1073741824 ; 2.0f .long 1077936128 ; 3.0f .long 1082130432 ; 4.0f ICC: movaps _2il0floatpacket.0, %xmm0 call printv xorps %xmm0, %xmm0 call printv _2il0floatpacket.0: .long 0x3f800000,0x40000000,0x40400000,0x40800000
Both GCC and ICC understood that the variable ‘m’ is const, and moved it to the .text (const) section. MSVC however chose to use 4 xmm registers to create ‘m’ – it not only wrote to valuable registers that in a real application are crucial to have, it also forces the use of the stack if those registers actually contained information, which is common when inlining. It will also invalidate cache usage, since the data is in the opcode, effectively eliminating future prefetches. All compilers however, used xorps to create a zero vector, which is pleasing to see.
Next test is arithmetic prefiction. The test will see how the compiler deals with predefined SSE operations, such as arithmetic, much like predefined integer operations. The compiler should predict and precompute operations such as ‘1+1′ and use the answer directly instead of making the CPU compute a static answer. The test is as follows:
#include <xmmintrin.h> extern void printv(__m128 m); int main() { __m128 m = _mm_set_ps(-4, -3, -2, -1); __m128 one = _mm_set1_ps(1.0f); printv(_mm_and_ps(m, _mm_setzero_ps())); // Always a zero vector printv(_mm_or_ps(m, _mm_set1_ps(-0.0f))); // Negate all (nop, all negative) printv(_mm_add_ps(m, _mm_setzero_ps())); // Add 0 (nop; x+0=x) printv(_mm_sub_ps(m, _mm_setzero_ps())); // Substruct 0 (nop; x-0=x) printv(_mm_mul_ps(m, one)); // Multiply by one (nop) printv(_mm_div_ps(m, one)); // Division by one (nop) return 0; }
On the first test, the compiler should always send a zero xmm register to printv, since x & 0 is always equal to 0. The rest of the tests should always result into sending the same register, since all the tests are a simple way to create a nop (no operation).
The results:
MSVC: movss xmm0, DWORD PTR __real@c0800000 movss xmm2, DWORD PTR __real@c0400000 movss xmm3, DWORD PTR __real@c0000000 movss xmm1, DWORD PTR __real@bf800000 unpcklps xmm3, xmm0 xorps xmm0, xmm0 unpcklps xmm1, xmm2 unpcklps xmm1, xmm3 movaps XMMWORD PTR tv129[esp+32], xmm0 movaps XMMWORD PTR _m$[esp+32], xmm1 andps xmm0, xmm1 call _printv movss xmm0, DWORD PTR __real@80000000 shufps xmm0, xmm0, 0 orps xmm0, XMMWORD PTR _m$[esp+32] call _printv movaps xmm0, XMMWORD PTR tv129[esp+32] addps xmm0, XMMWORD PTR _m$[esp+32] call _printv movaps xmm0, XMMWORD PTR _m$[esp+32] subps xmm0, XMMWORD PTR tv129[esp+32] call _printv GCC: xorps %xmm0, %xmm0 call printv movaps .LC0(%rip), %xmm0 call printv movaps .LC0(%rip), %xmm0 call printv movaps .LC0(%rip), %xmm0 call printv .LC0: .long 3212836864 .long 3221225472 .long 3225419776 .long 3229614080 ICC: xorps %xmm0, %xmm0 call printv movaps _2il0floatpacket.2, %xmm0 orps _2il0floatpacket.0, %xmm0 call printv movaps _2il0floatpacket.0, %xmm0 call printv movaps _2il0floatpacket.0, %xmm0 call printv movaps _2il0floatpacket.0, %xmm0 mulps _2il0floatpacket.1, %xmm0 call printv movaps _2il0floatpacket.0, %xmm0 divps _2il0floatpacket.1, %xmm0 call printv _2il0floatpacket.0: .long 0xbf800000,0xc0000000,0xc0400000,0xc0800000 _2il0floatpacket.1: .long 0x3f800000,0x3f800000,0x3f800000,0x3f800000 _2il0floatpacket.2: .long 0x80000000,0x80000000,0x80000000,0x80000000
The results are certainly interesting. MSVC has decided to not optimize the code and did exactly what it was told, resulting in redundant code. More should be noted: xorps (line 7) could’ve been moved after the unpcklps instruction (line 10) to take advantage of instruction pairing (when the processor executes the same opcode again, it’s usually faster, especially in SSE-land, where the CPU operates on large registers of 128bit). GCC’s code does exactly what we expect from a modern compiler; it performs static check for all operations. ICC seems to be selective on what it can determine, leaving out the redundant OR, multiplication and division while optimizing the others out.
Next test is regarding shuffles. There will be several tests regarding redundant shuffles that could be easily optimized out or merged, such as double reverses and subsequent shuffles
#include <xmmintrin.h> extern void printv(__m128 m); int main() { __m128 m = _mm_set_ps(4, 3, 2, 1); m = _mm_shuffle_ps(m, m, 0xE4); // NOP - shuffles to same order printv(m); m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector again, NOP printv(m); m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector m = _mm_shuffle_ps(m, m, 0x1B); // Reverses the vector again, NOP m = _mm_shuffle_ps(m, m, 0x1B); // All should be optimized to one shuffle printv(m); m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap pairs m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E printv(m); m = _mm_shuffle_ps(m, m, 0x55); // First element m = _mm_shuffle_ps(m, m, 0x55); // Redundant - since all are the same m = _mm_shuffle_ps(m, m, 0x55); // Let's stress it again m = _mm_shuffle_ps(m, m, 0x55); // And one last time printv(m); return 0; }
The results here should be minimum shuffles. First two tests should have no shuffle at all. Third should only have one shuffle to reverse the vector (mask = 0×1B). Forth test should merge the two shuffles into one shuffle, from mask 0xC9 and 0×2D to mask 0×4E (swap pairs). Last test should be optimized to only one shuffle, since all are ending up selecting the same value.
MSVC: movss xmm1, DWORD PTR __real@40800000 ; 4.0f movss xmm2, DWORD PTR __real@40400000 ; 3.0f movss xmm3, DWORD PTR __real@40000000 ; 2.0f movss xmm0, DWORD PTR __real@3f800000 ; 1.0f unpcklps xmm3, xmm1 unpcklps xmm0, xmm2 unpcklps xmm0, xmm3 shufps xmm0, xmm0, 228 ; 000000e4H movaps XMMWORD PTR _m$[esp+16], xmm0 call _printv movaps xmm0, XMMWORD PTR _m$[esp+16] shufps xmm0, xmm0, 27 ; 0000001bH shufps xmm0, xmm0, 27 ; 0000001bH movaps XMMWORD PTR _m$[esp+16], xmm0 call _printv movaps xmm0, XMMWORD PTR _m$[esp+16] shufps xmm0, xmm0, 27 ; 0000001bH shufps xmm0, xmm0, 27 ; 0000001bH shufps xmm0, xmm0, 27 ; 0000001bH movaps XMMWORD PTR _m$[esp+16], xmm0 call _printv movaps xmm0, XMMWORD PTR _m$[esp+16] shufps xmm0, xmm0, 201 ; 000000c9H shufps xmm0, xmm0, 45 ; 0000002dH movaps XMMWORD PTR _m$[esp+16], xmm0 call _printv movaps xmm0, XMMWORD PTR _m$[esp+16] shufps xmm0, xmm0, 85 ; 00000055H shufps xmm0, xmm0, 85 ; 00000055H shufps xmm0, xmm0, 85 ; 00000055H shufps xmm0, xmm0, 85 ; 00000055H call _printv GCC: movaps .LC0, %xmm1 movaps %xmm1, %xmm0 movaps %xmm1, -24(%ebp) call printv movaps -24(%ebp), %xmm1 movaps %xmm1, %xmm0 call printv movaps .LC1, %xmm0 call printv movaps .LC1, %xmm1 shufps $201, %xmm1, %xmm1 shufps $45, %xmm1, %xmm1 movaps %xmm1, %xmm0 movaps %xmm1, -24(%ebp) call printv movaps -24(%ebp), %xmm1 movaps %xmm1, %xmm0 shufps $85, %xmm1, %xmm0 call printv .LC0: .long 1065353216 ; 1.0f .long 1073741824 ; 2.0f .long 1077936128 ; 3.0f .long 1082130432 ; 4.0f .LC1: .long 1082130432 ; 4.0f .long 1077936128 ; 3.0f .long 1073741824 ; 2.0f .long 1065353216 ; 1.0f ICC: movaps _2il0floatpacket.0, %xmm0 addl $4, %esp shufps $228, %xmm0, %xmm0 movaps %xmm0, (%esp) call printv movaps (%esp), %xmm0 shufps $27, %xmm0, %xmm0 shufps $27, %xmm0, %xmm0 movaps %xmm0, (%esp) call printv movaps (%esp), %xmm0 shufps $27, %xmm0, %xmm0 shufps $27, %xmm0, %xmm0 shufps $27, %xmm0, %xmm0 movaps %xmm0, (%esp) call printv movaps (%esp), %xmm0 shufps $201, %xmm0, %xmm0 shufps $45, %xmm0, %xmm0 movaps %xmm0, (%esp) call printv movaps (%esp), %xmm0 shufps $85, %xmm0, %xmm0 shufps $85, %xmm0, %xmm0 shufps $85, %xmm0, %xmm0 shufps $85, %xmm0, %xmm0 call printv _2il0floatpacket.0: .long 0x3f800000,0x40000000,0x40400000,0x40800000
The results are interesting and quite surprising – GCC passed all of the tests but the shuffle merge while MSVC and ICC didn’t optimize any of the shuffles. Shame. Interesting to note that GCC chose to duplicate the original reverse vector for post operations instead of caching it in an extra xmm register. (Copying registers is faster than copying memory, even if it’s aligned).
All of the previous tests were about the compiler being able to make decisions about static data. Now it’s time for functions, where input and output isn’t known. A notable example of a vector operation is normalization. Here is a function to normalize an SSE vector and return a normalized copy.
__m128 normalize(__m128 m) { __m128 l = _mm_mul_ps(m, m); l = _mm_add_ps(l, _mm_shuffle_ps(l, l, 0x4E)); return _mm_div_ps(m, _mm_sqrt_ps(_mm_add_ps(l, _mm_shuffle_ps(l, l, 0x11)))); }
The function is really optimized. It gives hints the compiler what should be a temporary variable and what should be reused and takes a total of 7 operations.
The results we expect are perfect projection of the SSE intrinsics to assembly using only 3 vectors (original, length and square):
MSVC: normalize: movaps xmm2, xmm0 mulps xmm2, xmm0 movaps xmm1, xmm2 shufps xmm1, xmm2, 78 ; 0000004eH addps xmm1, xmm2 movaps xmm2, xmm1 shufps xmm2, xmm1, 17 ; 00000011H addps xmm2, xmm1 sqrtps xmm1, xmm2 divps xmm0, xmm1 ret GCC: normalize: movaps %xmm0, %xmm2 mulps %xmm0, %xmm2 movaps %xmm2, %xmm1 shufps $78, %xmm2, %xmm1 addps %xmm2, %xmm1 movaps %xmm1, %xmm2 shufps $17, %xmm1, %xmm2 addps %xmm2, %xmm1 sqrtps %xmm1, %xmm1 divps %xmm1, %xmm0 ret ICC: normalize: movaps %xmm0, %xmm3 mulps %xmm0, %xmm3 movaps %xmm3, %xmm1 shufps $78, %xmm3, %xmm1 addps %xmm1, %xmm3 movaps %xmm3, %xmm2 shufps $17, %xmm3, %xmm2 addps %xmm2, %xmm3 sqrtps %xmm3, %xmm4 divps %xmm4, %xmm0 ret
Good to see that all compilers are equal in here. These are exactly the results we expected.
Next test would be combining function calls and static compile-time data to get inline functions. Inline functions should embed the function’s code into the calling routine and case-optimize if possible. A classic case of inline functions is ‘abs’:
#include <xmmintrin.h> extern void printv(__m128 m); /* This is called _mm_abs_ps because 'abs' is a built in function and C does not allow overloading */ inline __m128 _mm_abs_ps(__m128 m) { return _mm_andnot_ps(_mm_set1_ps(-0.0f), m); } int main() { // All positive printv(_mm_abs_ps(_mm_set_ps(1.0f, 0.0f, 0.0f, 1.0f))); // All negative printv(_mm_abs_ps(_mm_set_ps(-1.0f, -0.0f, -0.0f, -1.0f))); // Mixed printv(_mm_abs_ps(_mm_set_ps(-1.0f, -0.0f, 0.0f, 1.0f))); }
The results we expect are perfect inlining of the function, resulting in the same vector over the three calls. A good compiler will also not duplicate the data over the three calls and reuse the same vector for the program, since the linker will most likely not do it.
MSVC: main: movss xmm1, DWORD PTR __real@3f800000 xorps xmm2, xmm2 movss xmm0, DWORD PTR __real@80000000 movaps xmm3, xmm1 movaps xmm4, xmm2 shufps xmm0, xmm0, 0 movaps XMMWORD PTR tv166[esp+16], xmm0 unpcklps xmm2, xmm3 unpcklps xmm1, xmm4 unpcklps xmm1, xmm2 andnps xmm0, xmm1 call printv movss xmm2, DWORD PTR __real@bf800000 movss xmm1, DWORD PTR __real@80000000 movaps xmm0, XMMWORD PTR tv166[esp+16] movaps xmm3, xmm2 movaps xmm4, xmm1 unpcklps xmm1, xmm3 unpcklps xmm2, xmm4 unpcklps xmm2, xmm1 andnps xmm0, xmm2 call printv movss xmm1, DWORD PTR __real@bf800000 movss xmm2, DWORD PTR __real@80000000 xorps xmm3, xmm3 movss xmm4, DWORD PTR __real@3f800000 movaps xmm0, XMMWORD PTR tv166[esp+16] unpcklps xmm3, xmm1 unpcklps xmm4, xmm2 unpcklps xmm4, xmm3 andnps xmm0, xmm4 call printv GCC: main: movaps .LC1, %xmm0 call printv movaps .LC1, %xmm0 call printv movaps .LC1, %xmm0 call printv .LC0: .long 2147483648 ; -0.0f .long 2147483648 ; -0.0f .long 2147483648 ; -0.0f .long 2147483648 ; -0.0f .LC1: .long 1065353216 ; 1.0f .long 0 ; 0.0f .long 0 ; 0,0f .long 1065353216 ; 1.0f ICC: movaps _2il0floatpacket.7, %xmm0 addl $4, %esp movaps %xmm0, (%esp) andnps _2il0floatpacket.6, %xmm0 call printv movaps (%esp), %xmm0 andnps _2il0floatpacket.8, %xmm0 call printv movaps (%esp), %xmm0 andnps _2il0floatpacket.9, %xmm0 call printv _2il0floatpacket.6: .long 0x3f800000,0x00000000,0x00000000,0x3f800000 _2il0floatpacket.7: .long 0x80000000,0x80000000,0x80000000,0x80000000 _2il0floatpacket.8: .long 0xbf800000,0x80000000,0x80000000,0xbf800000 _2il0floatpacket.9: .long 0x3f800000,0x00000000,0x80000000,0xbf800000 _2il0floatpacket.11: .long 0x80000000,0x80000000,0x80000000,0x80000000
This time each compiler chose it’s own way of optimizing. MSVC’s horrible assignment code in addition to it’s inability to predict static operations resulted in redundant code. ICC inlined the function, but kept some of the static data in the stack (while comfortably available on aligned read-only space) and did not perform any precomputation. GCC optimizes the code as we expected, but it “forgot” to remove the unnecessary helper vector (LC0) which is not used. This isn’t a big deal though because the linker will simply remove unreferenced const objects. GCC most likely kept it for when the inline function would have had use for it.
A good compiler should also predict branches and eliminate the unused code if the check is always true or false. SSE provides a way to compare 4 floats at once using the cmp*ps routines. If the result is true, the instruction puts a mask of 1s on the component. If it is false, 0. This instruction could be eliminated easily if the result is known during compile time – especially in inline functions. The test will implement the function ’sign’ which returns 1, 0 or -1 per component.
#include <xmmintrin.h> extern void printv(__m128 m); inline __m128 sign(__m128 m) { return _mm_and_ps(_mm_or_ps(_mm_and_ps(m, _mm_set1_ps(-0.0f)), _mm_set1_ps(1.0f)), _mm_cmpneq_ps(m, _mm_setzero_ps())); } int main() { __m128 m = _mm_setr_ps(1, -2, 3, -4); printv(_mm_cmpeq_ps(m, m)); // Equal to itself printv(_mm_cmpgt_ps(m, _mm_setzero_ps())); // Greater than zero printv(_mm_cmplt_ps(m, _mm_setzero_ps())); // Less than zero printv(sign(_mm_setr_ps( 1, 2, 3, 4))); // All 1's printv(sign(_mm_setr_ps(-1, -2, -3, -4))); // All -1's printv(sign(_mm_setr_ps( 0, 0, 0, 0))); // All 0's printv(sign(m)); // Mixed }
A good compiler will eliminate those checks and will create a const copy of the result, especially in places of where all comparisons fail, resulting a zero vector. In the following test, we will check the generated code of several const comparison results without sacrificing application size.
MSVC: movss xmm2, DWORD PTR __real@40400000 movss xmm3, DWORD PTR __real@c0800000 movss xmm0, DWORD PTR __real@3f800000 movss xmm1, DWORD PTR __real@c0000000 unpcklps xmm0, xmm2 unpcklps xmm1, xmm3 unpcklps xmm0, xmm1 movaps XMMWORD PTR _m$[esp+64], xmm0 cmpeqps xmm0, xmm0 call _printv xorps xmm0, xmm0 movaps XMMWORD PTR tv258[esp+64], xmm0 cmpltps xmm0, XMMWORD PTR _m$[esp+64] call _printv movaps xmm0, XMMWORD PTR _m$[esp+64] cmpltps xmm0, XMMWORD PTR tv258[esp+64] call _printv movss xmm2, DWORD PTR __real@3f800000 movss xmm3, DWORD PTR __real@40400000 movss xmm4, DWORD PTR __real@40800000 movss xmm0, DWORD PTR __real@40000000 movaps xmm1, xmm2 shufps xmm2, xmm2, 0 movaps XMMWORD PTR tv271[esp+64], xmm2 unpcklps xmm0, xmm4 unpcklps xmm1, xmm3 unpcklps xmm1, xmm0 movss xmm0, DWORD PTR __real@80000000 shufps xmm0, xmm0, 0 movaps XMMWORD PTR tv272[esp+64], xmm0 andps xmm0, xmm1 orps xmm0, xmm2 movaps xmm2, XMMWORD PTR tv258[esp+64] cmpneqps xmm2, xmm1 andps xmm0, xmm2 call _printv movss xmm2, DWORD PTR __real@c0400000 movss xmm3, DWORD PTR __real@c0800000 movss xmm1, DWORD PTR __real@bf800000 movss xmm0, DWORD PTR __real@c0000000 unpcklps xmm1, xmm2 movaps xmm2, XMMWORD PTR tv258[esp+64] unpcklps xmm0, xmm3 unpcklps xmm1, xmm0 movaps xmm0, XMMWORD PTR tv272[esp+64] andps xmm0, xmm1 orps xmm0, XMMWORD PTR tv271[esp+64] cmpneqps xmm2, xmm1 andps xmm0, xmm2 call _printv xorps xmm1, xmm1 movaps xmm0, xmm1 movaps xmm2, xmm1 movaps xmm3, xmm1 unpcklps xmm1, xmm2 movaps xmm2, XMMWORD PTR tv258[esp+64] unpcklps xmm0, xmm3 unpcklps xmm1, xmm0 movaps xmm0, XMMWORD PTR tv272[esp+64] andps xmm0, xmm1 orps xmm0, XMMWORD PTR tv271[esp+64] cmpneqps xmm2, xmm1 andps xmm0, xmm2 call _printv movaps xmm2, XMMWORD PTR _m$[esp+64] movaps xmm0, XMMWORD PTR tv272[esp+64] movaps xmm1, XMMWORD PTR tv258[esp+64] andps xmm0, xmm2 orps xmm0, XMMWORD PTR tv271[esp+64] cmpneqps xmm1, xmm2 andps xmm0, xmm1 call _printv GCC: movaps .LC2(%rip), %xmm0 movaps %xmm0, (%rsp) cmpeqps %xmm0, %xmm0 call printv xorps %xmm0, %xmm0 cmpltps (%rsp), %xmm0 call printv xorps %xmm1, %xmm1 movaps (%rsp), %xmm0 cmpltps %xmm1, %xmm0 call printv xorps %xmm0, %xmm0 cmpneqps .LC3(%rip), %xmm0 andps .LC1(%rip), %xmm0 call printv movaps .LC0(%rip), %xmm0 xorps %xmm1, %xmm1 orps .LC1(%rip), %xmm0 cmpneqps .LC4(%rip), %xmm1 andps %xmm1, %xmm0 call printv xorps %xmm0, %xmm0 cmpneqps %xmm0, %xmm0 andps .LC1(%rip), %xmm0 call printv xorps %xmm1, %xmm1 movaps (%rsp), %xmm0 cmpneqps %xmm1, %xmm0 movaps (%rsp), %xmm1 andps .LC0(%rip), %xmm1 orps .LC1(%rip), %xmm1 andps %xmm1, %xmm0 call printv .LC0: .long 2147483648 .long 2147483648 .long 2147483648 .long 2147483648 .LC1: .long 1065353216 .long 1065353216 .long 1065353216 .long 1065353216 .LC2: .long 1065353216 .long 3221225472 .long 1077936128 .long 3229614080 .LC3: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432 .LC4: .long 3212836864 .long 3221225472 .long 3225419776 .long 3229614080 ICC: movaps _2il0floatpacket.8, %xmm0 addl $4, %esp cmpeqps %xmm0, %xmm0 call printv xorps %xmm0, %xmm0 cmpltps _2il0floatpacket.8, %xmm0 call printv movaps _2il0floatpacket.8, %xmm0 xorps %xmm1, %xmm1 cmpltps %xmm1, %xmm0 call printv movaps _2il0floatpacket.9, %xmm0 movaps _2il0floatpacket.9, %xmm2 andps _2il0floatpacket.10, %xmm0 xorps %xmm1, %xmm1 cmpneqps %xmm1, %xmm2 orps _2il0floatpacket.11, %xmm0 andps %xmm2, %xmm0 call printv movaps _2il0floatpacket.12, %xmm1 movaps _2il0floatpacket.10, %xmm0 andps %xmm1, %xmm0 orps _2il0floatpacket.11, %xmm0 xorps %xmm2, %xmm2 cmpneqps %xmm2, %xmm1 andps %xmm1, %xmm0 call printv xorps %xmm0, %xmm0 cmpneqps %xmm0, %xmm0 call printv movaps _2il0floatpacket.10, %xmm0 movaps _2il0floatpacket.8, %xmm1 andps %xmm1, %xmm0 orps _2il0floatpacket.11, %xmm0 xorps %xmm2, %xmm2 cmpneqps %xmm2, %xmm1 andps %xmm1, %xmm0 call printv _2il0floatpacket.8: .long 0x3f800000,0xc0000000,0x40400000,0xc0800000 _2il0floatpacket.9: .long 0x3f800000,0x40000000,0x40400000,0x40800000 _2il0floatpacket.10: .long 0x80000000,0x80000000,0x80000000,0x80000000 _2il0floatpacket.11: .long 0x3f800000,0x3f800000,0x3f800000,0x3f800000 _2il0floatpacket.12: .long 0xbf800000,0xc0000000,0xc0400000,0xc0800000 _2il0floatpacket.14: .long 0x80000000,0x80000000,0x80000000,0x80000000 _2il0floatpacket.15: .long 0x3f800000,0x3f800000,0x3f800000,0x3f800000
None of the compilers optimized the comparisons, which could benefit the code in a large extent, especially when inlined. It’s notable to mention that GCC merged some of constants, eliminating 2 of the vectors that ICC left. ICC and GCC both optimized useless ORs where possible while MSVC simply followed the code intrinsic by intrinsic.
I keep hearing the catch-phrase among programmers that “the compiler is better than you [think].” I completely disagree with it and object the use of it. Not only it makes novice programmers misunderstand it and give the compiler a lot of credit where it’s impossible to expect a compiler to optimize a case, it also makes more advance programmers become lazy and believe the compiler does know what it’s doing.
Proven here is a case using the so called ‘intrinsics’ to guide the compiler as opposed of instructing it. As seen by the above examples, only GCC (and to an extent, ICC) behaves the way we expect it to though it still misses a few of the cases (such as merging shuffles and predicting vector branches). MSVC is most likely the worst example of an SSE-guided compiler – not only it did not optimize any of the tests, it generated horrible assignment code which abused the stack most of the time and hurt performance by not utilizing cache properly.
If you are to code using SSE intrinsics, I advise you to take a closer look at the code if you want maximum performance. Taking advantage of SSE for speed will result a lot of satisfaction if used properly – instruction pairing, redundant arithmetic operations and redundant compares should be optimized by human beings most of the time and you should not rely on the compiler to do that. Compilers are given much more credit than they deserve.
As a side note about GCC’s near perfection in code generation – I was quite surprised seeing it surpass even Intel’s own compiler! It shows that even compiler writers, who know their own hardware and internal mechanisms, can overlook simple problems in the way humans think – redundancy in most cases. I highly recommend giving the newest GCC 4.4 a try, if you are on Linux, you most likely have GCC 4.3.x, or if your distribution is an early bird (Gentoo, Fedora…), you might already have it. Windows users are lucky enough to know that GCC 4.4 have been ported successfully to Windows on both the MinGW suite and the TDM suite. Mac users might have to compile gcc 4.4 themselves using Xcode (which is actually gcc 4.0.1).
Happy optimizing!
]]>One old remedy for this was supposedly mod_evasive, but it doesn’t really work against that specific type of attack as it acts too late to understand it’s an attack.
Very recently, an Apache mod fixing this vulnerability had been released – mod_antiloris, but it’s made with a RedHat based server in mind. Here are the steps to get it working on a Debian or any other Debian compatible server (such as Ubuntu).
First install the prerequisites. I assume you are using the threaded version of Apache, else you are not vulnerable to this type of attack.
sudo apt-get install gcc apache2-threaded-dev
Next, get the module source, extract it and compile:
wget "ftp://ftp.monshouwer.eu/pub/linux/mod_antiloris/mod_antiloris-0.3.tar.bz2" tar xvf mod_antiloris-0.3.tar.bz2 cd mod_antiloris-0.3/
The following command will end up in error – this is perfectly normal! Since apxs2 (Apache extension service) for Debian isn’t modified to handle Debian-style modules, do not run it as root as it will mess up with your system, thinking it’s RedHat compatible.
apxs2 -a -i -c mod_antiloris.c
Because apxs2 didn’t have permission to copy the module, we’ll do it ourselves:
sudo cp .libs/mod_antiloris.so /usr/lib/apache2/modules/mod_antiloris.so
Now we’ll add Debian-style .load file to auto load the module:
sudo su -c "echo 'LoadModule antiloris_module /usr/lib/apache2/modules/mod_antiloris.so' > /etc/apache2/mods-available/antiloris.load"
Then we’ll enable the module, Debian style:
sudo a2enmod antilorisAnd reload Apache’s configurations and modules:
sudo /etc/init.d/apache2 reload
This module solves the slowloris DoS attack – so I urge you to install it as soon as possible if you are using Apache as your HTTP server.
I would like to make sure credit is where it is due – I did not develop this module – I just wrote instructions on how to make it Debian compatible, since it seems to be RedHat centric. The module was written and hosted by Kees Monshouwer, which I cannot seem to find any official website associated with.
I hope this will help people as much as it helped me.
Check Sintia’s page here.
]]>I hope someone will find this useful.
]]>
First I would like to start with the fact that this demo was lying around in my HDD since Halloween.
This demo was written to demonstrate how easy it is to utilize the DS’s hardware blending and create impressive effects with no effort. In this demo, the witch is flying in the sky, and whenever she’s hovering between the moon, she turns black, because the light from the moon illusions it as such.
The demo is composed of 2 backgrounds and a sprite. the sprite is set to blend with the first background (the moon) with 0 blending, resulting the black color.
Download: blending-demo.
]]>Ever since the latest Rhythmbox release, there has been an undocumented feature in rhythmbox-client to print the string received from shoutcast streams, such as my favorite di.fm radio, which I normally have on. I found several xchat-rhythmbox announcers but they all lacked the ability to determine if rhythmbox currently streams music or listens to a music file.
Now that I actually have free time, I could write a small script to do exactly what I wanted, and I’ve decided to share it. The source/script is released under the terms of the WTFPL.
Download link: rhythmbox_nowplaying.tar.gz.
]]>I felt the need to take EventDispatcher outside of my flash projects to my more advanced C++ ones. This turned out to be quite an easy task.
Instead of a boring ‘download code’ link, I will write the steps of implementing it using C++’s STLs, just because I feel like writing.
I would like to start by noting that I will only implement the EventDispatcher class, and not flash.events.Event. I will have a very basic and dull Event class. Another note would be the Object class, which isn’t reimplemented for performance (RTTI hacks are expensive).
Let’s start by examining our situation:
In this article, I will tackle those one by one.
Let’s starts with the basics, we know EventDispatcher::addEventListener takes functions as callbacks, so let’s define an event listener type. Before we can do that though, we also need to define an Event class to be passed when the event callback is called.
// Event class class Event { public: Event(const std::string &type, bool bubbles = false, bool cancelable = false): type(type), bubbles(bubbles), cancelable(cancelable) { } const std::string type; const bool bubbles; const bool cancelable; /* const void* target; const unsigned int eventPhase; const void* currentTarget; */ }; // Event function callback pointer type typedef void (*eventFunctionPtr)(const Event &);
The Event class does nothing important but emulating flash’s Event class and holding a const std::string for the type of the event.
The callback type, called eventFunctionPtr basically points to a function that returns nothing (void) and takes a const refrence to an Event as an argument, so the following AS3 code:
public function eventListener(event:Event):void { // Listener code ... }
Would become this code in C++:
void eventListener(const Event &event) { // Listener code ... }
Now that we have our Event class type and the function callback type, let’s implement the basics – mapping string to function callbacks. This can easily be done using std::map which stores data (in this case, a function pointer) that can later be retrieved by using a key (in this case, a string). This is a bit like using a primary key to refer to a record in a database table.
The code at the moment is quite simple, having only addEventListener coded:
class EventDispatcher { public: void addEventListener(const std::string &type, eventFunctionPtr listener, bool useCapture = false, int priority = 0, bool useWeakReference = false) { // Set the event listener to the key eventHandlerList[type] = listener; } private: std::map<const std::string, eventFunctionPtr > eventHandlerList; };
Implementing the method hasEventListener is also effortless, since we are just checking to see if a key exists on the map:
bool hasEventListener(const std::string &type) { return (eventHandlerList.find(type) != eventHandlerList.end()); }
Now for the heart of the EventDispatcher class – the dispatchEvent method, which will simply execute the function we get from the key:
void dispatchEvent(const Event &event) { if(hasEventListener(event.type)) eventHandlerList[event.type](event); }
The reason we’re checking if the event listener exists for this event is because std::map will create a null function pointer, which will be slow, memory consuming and will report false positives when using hasEventListener when calling dispatchEvent with event type that is not registered with our EventDispatcher. This simple check makes sure we only execute the event listener if it has in fact, a callback.
When we try to implement the method removeEventListener, we come across a problem – we only have one callback for each event string. removeEventListener method takes a string and a function pointer which will not be used. If one callback per event is what you need, the following code will do the trick:
void removeEventListener(const std::string &type, eventFunctionPtr listener, bool useCapture = false) { eventHandlerList.erase(type); }
Since that’s not what we desire, we will move on to the next item on the list, which specifies that each event can have several listeners. We can easily do that by mapping a string to a list of function pointers using std::list.
Let’s start by redefining eventHandlerList to it’s new type:
std::map<const std::string, std::list<eventFunctionPtr > > eventHandlerList;
However that requires us to change addEventListener and removeEventListener, although hasEventListener will remain unchanged.
addEventListener has the easiest ‘fix’, now instead of assigning the listener to the map’s key, we add it to the list we receive. This changes addEventListener to the following code:
void addEventListener(const std::string &type, eventFunctionPtr listener, bool useCapture = false, int priority = 0, bool useWeakReference = false) { // Simply add the event listener to the list of listeners eventHandlerList[type].push_back(listener); }
In removeEventHandler’s case, we remove all occurrences of the listener from the list, first checking if the map has this key registered:
void removeEventListener(const std::string &type, eventFunctionPtr listener, bool useCapture = false) { if(hasEventListener(type)) eventHandlerList[type].remove(listener); }
The dispatchEvent method, however, gets a complete make over, because this time we have to iterate over the list of callbacks and execute them all:
void dispatchEvent(const Event &event) { // Leave if no event registered if(!hasEventListener(event.type)) return; // A reference to keep code clean std::list<eventFunctionPtr > &allFunctions = eventHandlerList[event.type]; // Iterate through all functions in the event and execute them for(std::list<eventFunctionPtr >::iterator i=allFunctions.begin(); i!=allFunctions.end(); ++i) (*i)(event); }
The function hasEventListener does not need to change, and still works with our new structure, since the base data structure is still std::map.
Now that we are done with that, we are facing a new problem – priorities. Flash enables you to give priorities to the function, letting them be closer to the time of the event was dispatched. However, Flash is so flexible that it lets us set the priorities as an signed 32bit integer. Imagine this – an array of 4,294,967,296 (2 to the power of 32 – 4 billion) lists residing on memory for each event we have – this is huge!
A neat solution would be to use another map to map integers to list, this time for the sole purpose of saving memory, not speed.
So this time our eventHandlerList evolved into this scary looking type:
std::map<const std::string, std::map<int, std::list<eventFunctionPtr > > > eventHandlerList;;
In case you are lost, here is a quick description of what’s going on: this structure maps a string (event type) to another map, which maps a 32bit integer to a list of function pointers.
Again, we will start by modifying the simplest method, which is addEventListener. The change this time, will simply map the priority in addition to the type:
void addEventListener(const std::string &type, eventFunctionPtr listener, bool useCapture = false, int priority = 0, bool useWeakReference = false) { // Simply add the event listener to the list of listeners for the selected priority eventHandlerList[type][priority].push_back(listener); }
Dispatching an event now gets a little more interesting, since we have to iterate over two structures – first the map of priorities, then the list of callbacks. We will be iterating the map in reverse, since we want higher priority functions (higher keys) to be executed first as opposed to the way std::map iterates which is from lowest to highest (negative to positive).
void dispatchEvent(const Event &event) { // Leave if no event registered if(!hasEventListener(event.type)) return; // A reference to keep code clean std::map<int, std::list<eventFunctionPtr > > &allFunctions = eventHandlerList[event.type]; // Iterate through all functions in the event, from high proproty to low for(std::map<int, std::list<eventFunctionPtr > >::reverse_iterator i=allFunctions.rbegin(); i!=allFunctions.rend(); ++i) { std::list<eventFunctionPtr > &funcList = i->second; // Execute callbacks for(std::list<eventFunctionPtr >::iterator f=funcList.begin(); f!=funcList.end(); ++f) (*f)(event); } }
The real tricky method is removeEventListener – because we only receive the event type and the listener to remove, we have no information on what priority the callback has and where it’s located inside our structure, which means we will have to search for it.
Another problem is false positives – if we remove an event listener from the list, an empty list will stay in memory, meaning that hasEventListener will return true if the list is empty. A way to overcome this problem is to erase the list from the priority map whenever it’s empty AND remove the priority map from the event type map.
The following code will do both:
void removeEventListener(const std::string &type, eventFunctionPtr listener, bool useCapture = false) { // Leave if no event registered if(!hasEventListener(type)) return; // Reference to keep the code clean std::map<int, std::list<eventFunctionPtr > > &allFunctions = eventHandlerList[type]; // Since we don't know the function's priority, we'll search for it for(std::map<int, std::list<eventFunctionPtr > >::iterator i=allFunctions.begin(); i!=allFunctions.end(); ++i) { // Saving a branch here: instead of checking if the callback exists let remove() do it for us i->second.remove(listener); // Remove object from the map if list gone empty to eliminate false positives if(i->second.empty()) allFunctions.erase(i); } // Remove map to eliminate false positives if(allFunctions.empty()) eventHandlerList.erase(type); }
I could go on, but from here on, the class really starts integrating into the GUI code part (bubbling, phases…) which would require a GUI code.
I hope this would help someone, I just wanted to write something on my aging site.
]]>