Agreed. I’ve been writing code with liberal use of SSE intrinsics lately and it’s been quite difficult:
*need to painstakingly massage and permute the data into vector registers with
_mm_unpack, or _mm_shuffle
*specialization for different data types:
example: SSE2 has no uint8 compare less equal, so I had to do a monotonic conversion to int8 by adding -128 and using signed 8 bit compare:
aOffset = a - 128
bOffset = b - 128
return aOffset < bOffset || aOffset = bOffset
Luckily, I discovered later, a <= b == min(a, b) = a, which is more efficient than method above.
This is appalling. Why hasn’t a decent SSE capable compiler been released? I’ve tried GCC’s ftree-vectorize and all it can seem to do is SIMDize adding 2 arrays, a memset, or other trivial operations. It can’t do things like compacting elements, which I do painstakingly with _mm_shuffle. I heard Intel C++ compiler has better SIMDization, but haven’t tried it. But, I wouldn’t be surprised if it’s not decent - Intel’s idea of optimization is have an army of people doing manual optimization (datapaths and IPP libraries) instead of building better tools.
With this deficiency, I can’t see a good future for SSE. Currently, it seems GPUs have a 4x raw Gflop advantage over CPUs (same transistor budget). Since SSE is so clumsy to use, another 4x speedup is likely. This has to be why CUDA doesn’t use operations on vector registers and uses SIMT instead, though it’s a mirage performance wise (still better to have all threads run same instruction).
On the plus side, when I use SSE, I know the code is going to be as optimized as can be for 1 thread. For CUDA, there always seems to be more optimization opportunities (maybe just my inexperience).