Sometimes you find hidden nuggets that aren’t in change lists…
It looks like 3.2 snuck in a simple new intrinsic, a “swizzle” operator. I never noticed since it’s just a short entry in the programming guide.
This lets you reorder or duplicate bytes sampled from two different 4-byte words. SSE intrinsics on CPUs have similar swizzlers.
You could of course do this with shifts and masks but it looks like this is a builtin op!
I’m especially happy that this is here since I’ve had to do such reordering. It’s usually not a big efficiency problem, but it’s just nice to replace 4 lines of code filled with shifts and masks with a single line.
I haven’t checked the PTX… I’m not sure if this reduces to a single-op intrinsic or not. I suspect it does, since swizzling like this is common in Cg and shaders.
While talking about swizzling, I wonder if there’s an efficient way to swizzle out access to the high and low words of a 64 bit integer? It should be a 0 cost conversion, sort of like __float_as_int. I do this kind of low level data updates in my PRNG code. Eventually I should rewrite it all in PTX but it’d be nice if the CUDA code were sufficient.
For example you may have
unsigned long x64 = something(); unsigned int loword = (unsigned int) x64; // truncates to locate low 32 bits unsigned int hiword = (unsigned int) (x64>>32);
The efficiency loss is that a bit shift isn’t free, even though the shift is just to get access to the high word.