Odd/Even Shfl Issues

Is something like the following snippet valid (where the entire warp is coalesced)

(array is of ulong4[])

ulong4 output;
coalesced_group active = coalesced_threads();

if (active.thread_rank() % 2)
	output = array[id];
	output.x = active.shfl_down(output.x, 1);
	output.y = active.shfl_down(output.y, 1);
	output.z = active.shfl_down(output.z, 1);
	output.w = active.shfl_down(output.w, 1);

I’d like each odd thread in the warp to load in some data from “array” and the have each even thread copy that data. Each odd/even pair of threads in the warp should have the same data.

Basically, each odd thread should copy its higher next door neighbours data.

Currently, my results are unexpected but I don’t see what the problem is with this.


if (active.thread_rank() % 2)
	output = array[id];


if (active.thread_rank() % 2)
	output = array[id];
if (active.thread_rank() % 2 == 0)


  1. ensures that the output variable is filled prior to use (C++ rules allow to reorder then and else blocks)
  2. ensures that the output variable contents is propagated among all threads

I’d like to utilise shfl though, imagine that “array” is in device memory. Shfl will cut down on the amount of loads in this kernel by a large amount.

Lane 0,1 need the same data, lanes 2,3 need the same data etc.

oh, sorry, i edited my message. that said, i’m pretty sure that it will make your program slower compared to straightforward solution. check cuda manual for description of coalesced access. if two lanes of the same warp reads the same memory address, data are just replicated in the load engine. traffic will be the same, and single output = array[id] assignment with right id will be faster than the same assignment prepended by if, all the more so than the full your or mine code

I think I’m just repeating BulatZiganshin here.

I think a straight up load will be quicker:

output = array[id | 1];

If you want to do it with shuffle, this should work:

ulong4 output;
coalesced_group active = coalesced_threads();

if (active.thread_rank() % 2)
	output = array[id];
output.x = active.shfl_down(output.x, ((active.thread_rank()+1)&1));
output.y = active.shfl_down(output.y, ((active.thread_rank()+1)&1));
output.z = active.shfl_down(output.z, ((active.thread_rank()+1)&1));
output.w = active.shfl_down(output.w, ((active.thread_rank()+1)&1));

Note that for a shuffle transfer to be valid for a particular lane, both source and destination lanes must be participating in the shuffle operation. Since even and odd lanes are involved here, all 32 threads in the warp must participate.

Note that the load operation here may be affected on windows vs. linux. On windows, ulong4 is a vector type of four 32-bit quantities (which can be pulled in a single vector load per thread). On linux it is a vector type of four 64-bit quantities (which cannot be done in a single vector load per thread). This will have an effect on the load operation.

Ok, thanks for your help both!

As you expected, it was slower doing the shfl. I’d missed that the data gets replicated across lanes - thanks for the tip!