Toolkit 3.2 goodies

Sometimes you find hidden nuggets that aren’t in change lists…

It looks like 3.2 snuck in a simple new intrinsic, a “swizzle” operator. I never noticed since it’s just a short entry in the programming guide.


This lets you reorder or duplicate bytes sampled from two different 4-byte words. SSE intrinsics on CPUs have similar swizzlers.

You could of course do this with shifts and masks but it looks like this is a builtin op!

I’m especially happy that this is here since I’ve had to do such reordering. It’s usually not a big efficiency problem, but it’s just nice to replace 4 lines of code filled with shifts and masks with a single line.

I haven’t checked the PTX… I’m not sure if this reduces to a single-op intrinsic or not. I suspect it does, since swizzling like this is common in Cg and shaders.

While talking about swizzling, I wonder if there’s an efficient way to swizzle out access to the high and low words of a 64 bit integer? It should be a 0 cost conversion, sort of like __float_as_int. I do this kind of low level data updates in my PRNG code. Eventually I should rewrite it all in PTX but it’d be nice if the CUDA code were sufficient.

For example you may have

unsigned long x64 = something(); 

unsigned int loword = (unsigned int) x64; // truncates to locate low 32 bits

unsigned int hiword = (unsigned int) (x64>>32);

The efficiency loss is that a bit shift isn’t free, even though the shift is just to get access to the high word.

As I recall, this was introduced in CUDA 3.1 (which would explain why it isn’t mentioned in the CUDA 3.2 release notes :-). You may be interested in the following forum thread:

This is cool. Thanks!
But, I hate swizzlers… Thats one reason why SSE assembly code is literally un-readable.

There’s always more to learn! Thanks.

Answering part of my own question, I don’t think there’s a CUDA way other than the shift version. That gets translated into a 64 bit shr.u64 shift (expensive) plus a cvt.u32.u64 in PTX.

In raw PTX, you can use mov.b64 {lo,hi}, %x;.

I haven’t checked the disassembled cubins, I suspect the 64 bit shift gets translated into 4 or 5 ops on two words.

Then you’ll really love the Larrabee instruction set! It takes that idea of SSE assembly and extends it with two more layers of complexity and obfusciation. Powerful, but it just is impractical and difficult to use. And I’m saying this as an admitted micro-optimization geek who actually enjoyed Altivec and Cell SPU hacking.

A slightly hacky way that should achieve what you want [restricted to platforms >= sm_13 for obvious reasons] is:

long long int x64;
int hi = __double2hiint(__longlong_as_double(x64));
int lo = __double2loint(__longlong_as_double(x64));

The mov.b64 instructions generated by this should be optimized out by PTXAS. As a registered developer you can use cuobjdump to verify that that is the case. The compiler backend might even do a credible job of optimizing your current shift-based version but I have not actually looked into that.

To answer a question from your original post, __byte_perm() enjoys hardware support on sm_2x platforms, but is emulated in software for sm_1x platforms, which means it’s quite slow for the latter.

Clever idea! Some quick timing tests show that the shift version is much faster than the double2hiint hack. I also tried with a shift of 7 and a shift of 32, and the 32 shift was much much faster, so I bet the compiler does indeed optimize the 64 bit x>>32 shift case.

Hello someone out there!

When I try to install CUDA Toolkit 3.2, an error pops out saying “Graphics Hardware doesn’t support it… or something”.
I have bought a brand new Dell laptop with 1 GB NVIDIA GeForce GT 420M card in it… which does support CUDA according to CUDA Programming Guide…
I am a CUDA beginner. Can someone help me out with it???

And can anybody explain me how to get into CUDA emulation mode, i.e., to have CUDA code and stuff executed without actually having the graphics card…??