How to divide a 8 bytes data to two 4 bytes data, so that I could use warp shuffle feature?

There is a new feature named warp shuffle in kepler device. I want to use this feature in place of the shared memory. However, the guide book says that it can only shuffle 4 bytes data at one time, if I want to 8 bytes data, I should divide it into two separate parts. Now, I want to shuffle a double type data(8 bytes), how can I divide it into two parts?

Use the type-safe functions that “reinterpret” a double into its hi and lo binary components, shuffle and then recombine hi and lo back into a double:

Alternatively, you can drop down into PTX and do something similar but there is probably no need.

Ignore, bad formatting. Can’t preview and editing screwed things up even more due to apparent bulletin board bugs.

Ignore, bulletin board system screwed up formatting, and you can’t preview.

Wow, sorry the bulletin board system is failing. It didn’t represent the chevrons in a reinterpret cast correctly so I tried to edit and change them to a C style cast, but then it changed the union example to the same thing as the previous example!!

I’m going to try this one more time.

Reinterpret method:

double foo = ...;
int2 tmpForExch = *(int2 *)(&foo);

__shfl_up(tmpForExch.x, 1, 32);
__shfl_up(tmpForExch.y, 1, 32);

foo = *(double *)(&tmpForExch);

foo += 1;

and with a union:

union {
  int2 asInt2;
  double val;
} doubleForExch;

doubleForExch foo = ...;

foo.val = ...;

__shfl_up(foo.asInt2.x, 1, 32);
__shfl_up(foo.asInt2.y, 1, 32);

foo.val += 1;

Hi, I’m using Union now. Among the two option you post here, which one you think is faster?


Hi, it looks good. However, comparing with Union, which one you think could be faster?

The “cast to vector type” and union methods @eelsen described should be producing the same code as the reinterpret approach. It’s all up the compiler. :)

Nothing beats inspecting the PTX to make sure the compiler is doing what you expected.

Thanks, I will have a try.

I will point out that attempts to do binary re-interpretation via pointer casts invokes undefined C/C++ behavior and will not work as intended with various compilers, especially at higher optimization levels. For host code, a volatile union has worked reliably on every platform and tool chain I have worked with in the past twenty years, although strictly speaking one is on thin ice with regard to the standards as well. I believe gcc promises not break the union approach.

In the context of CUDA device code, I would recommend using either the built-in re-interpretation intrinsics, or C++ reinterpretation casts. E.g.:

float f;
int i = __float_as_int (f);
i = reinterpret_cast<int&>(f);

Now that njuffa mentions it I do remember at some point having to add volatiles to get the casting approach to work. The test case I just wrote works fine, but the casting approach also gets compiled into one more instructions than the other two. So I would have to agree in not recommending it, sorry.

Using the built-in reinterpretation intrinsics results in the same number of instructions of the union approach, BUT it uses two more registers. I don’t know enough about the standards to understand why the union approach isn’t kosher - it would be my preference - it results in cleaner code and uses less registers.

I can’t find chapter and verse on type punning via unions in my copy of the standard at the moment, but this thread on StackOverflow seems to contain pointers to all the relevant sections.

The difference in register use between the different re-interpretation methods you observe would appear to be an artifact (maybe of a particular source code usage or a particular toolchain), not something that holds in general? I used type re-interpretation intrinsics fairly extensively in the CUDA math library, and don’t recall any issues with register pressure caused by that. I have looked at the SASS for most of the functions in a fair amount of detail over the years.

If you have a simple example for this increase in register pressure from use of one of the type re-interpretation intrinsics, I would like to take a look at it. I guess this could happen if the code continues to use both the original and re-interpreted operands after invoking the intrinsic, which from that point forward are treated as separate objects. But if only the re-interpretation result is used after the intrinsic the original operand is “dead” and the re-interpreted operand can re-use the registers of the original operand. But that’s pure conjecture at this point.

Ok, so it seems like according to C99/C11 this union usage is totally fine. If not all the types are the same size then you may run into some trouble.

The register issue looks like it was because the compiler outsmarted my test case.
I wrote the test case to only modify the x component of the int2 and so the compiler was smart enough in the union case to only load those 4 bytes from global memory instead of the full 8 bytes which is why the number of registers it needed was lower.

Another approach that has worked, at least for me, is to keep the 64-bit type in its vector2 form and then before processing pack it into its scalar form, perform whatever 64-bit ops are required, then unpack it into its original vector format.

There is no need to do it this way unless you know you are going to need to process the vector type at least as frequently as the scalar 64-bit type.

The SASS wound up looking perfect to me.

The pack/unpack PTX mov instructions look something like this:

#define V2S_B64(v,s)   asm("mov.b64 %0, {%1,%2};" : "=l"(s)                  : "r"(v##.x), "r"(v##.y))
#define S2V_B64(s,v)   asm("mov.b64 {%0,%1}, %2;" : "=r"(v##.x), "=r"(v##.y) : "l"(s))

The PTX that gets generated looks unwieldy but when you dump the SASS it seems to be an idiom that ptxas really knows how to eliminate.

mov.b64 is exactly what __hiloint2double() maps to at PTX level. So for double data stored as textures, simply declare the texture as int2, then convert on the fly to double using this intrinsic. This type re-interpretation is free thanks to a shared register file. Using the PTX instruction is more versatile in that it can also be used for splitting an unsigned long long into two unsigned ints and similar use cases.