__shfl() for 64 double , Nvidia's example only for gcc?

Nvidia used this example __shfl() implementation for 64 bit doubles:

__device__ __inline__ double shfl(double x, int lane)
// Split the double number into 2 32b registers.
int lo, hi;
asm volatile( “mov.b32 {%0,%1}, %2;” : “=r”(lo), “=r”(hi) : “d”(x));
// Shuffle the two 32b registers.
lo = __shfl(lo, lane);
hi = __shfl(hi, lane);
// Recreate the 64b number.
asm volatile( “mov.b64 %0, {%1,%2};” : “=d(x)” : “r”(lo), “r”(hi));
return x;

But Visual Studio compiler is not accepting those assembly language statements. Specifically it does not like the second asm statement:

asm volatile( “mov.b64 %0, {%1,%2};” : “=d(x)” : “r”(lo), “r”(hi));

and says

'error : expected a “(” ’

It accepts the first asm volatile statement and the two __shfl()s, but that second asm volatile.

So either this code only works with gcc compiler, or I make some copy-paste translation error.

I put in standard " " quotes in place of those directional ones used in the example.

Is there another example implementation which would accomplish the same 64 bit shuffle result, or is this one correct for Windows operating system?

Is the above a one-to-one copy? It seems to me the mov.b32 should be a mov.b64. Also, there is a syntax error, it should be




This was the problem, thanks.

That implementation was from this Nvidia presentation:


page 6.

They must of made a syntax error, and since I am not very familiar asm I could not identify.

It has been my experience that the “pack/unpack” opcode is entirely optimized away in SASS. That is, pack/unpack represent the (only?) way to indicate to the compiler that there is a vector<>scalar transition.

Here are some macros I use to convert a “b32x2 to a b64” and “a b64 to a b32x2”. The second pair of macros do the same but swap the order of the words:

These are useful when working with SHFL.

Norbert is right, my slides contain typos. It should be mov.b64 because we unpack a double. Also, it must be “=d”(x).

An alternative without inline PTX, could be the following code fragment:

__device__ __forceinline__ double shfl( double r, int lane )
  int hi = __shfl( __double2hiint(r), lane );
  int lo = __shfl( __double2loint(r), lane );
  return __hiloint2double( hi, lo );

The PTX will contain a couple of extra MOVs which will be optimized away in SASS.

Ah, that’s much nicer than macros! Will cut-and-paste. :)

Are there similar intrinsics to split a longlong or ulonglong into 32 bit low and high words?


I don’t see anything in the Math API PDF or in include/math_functions*.h.

Would be nice to have in order to avoid PTX hackery.

currently hacking it like this where C[0] happens to be an ulonglong2 vector

(uint32_t)__double2loint(__longlong_as_double((long long int)(C[0].x)))

not nice. Trying to access the first uint32 word in memory. Not sure if I have the endianness right.

Ouch, my eyes! :) Perhaps dropping down to PTX makes sense in this case.

Something like this should work for ulonglong1… and you could just explicitly unpack the ulonglong2 into its x and y components?

__device__ __forceinline__
scalarToVector(const unsigned long long s)
  uint2 v;

  asm("mov.b64 {%0,%1}, %2;" : "=r"(v.x), "=r"(v.y) : "l"(s));
  return v;

__device__ __forceinline__
unsigned long long
vectorToScalar(const uint2 v)
  unsigned long long s;

  asm("mov.b64 %0, {%1,%2};" : "=l"(s) : "r"(v.x), "r"(v.y));
  return s;

So I think your example, for unsigned values, would be:


Trivia: I noticed earlier today that the 32-byte ulonglong4 is implemented but not documented in the programming guide. It is not a structure you want to be loading and storing to device memory since it won’t be automatically coalesced (at least not yet). Although it might be useful as a handle to 4 64-bit registers.

Please feel free to file an enhancement request [RFE] via the bug reporting form linked from the registered developer website for device functions that split and compose long longs. The few places where I have needed to take apart a long long in this fashion I have used mov.b64 via inline PTX.

Side remark: the proper place for such intrinsics would be device_functions.h, not one of the math function header files which are intended for math functions.

As for ulonglong4, that works just fine in my experience and can be useful as a container for 256-bit operands, for example when performing wide multiplies as I demonstrated in this thread:


I will check with the relevant team on the documentation issue; presumably that is just an inadvertent omission.