Is something like the following snippet valid (where the entire warp is coalesced)
(array is of ulong4[])
ulong4 output;
coalesced_group active = coalesced_threads();
if (active.thread_rank() % 2)
output = array[id];
else
{
output.x = active.shfl_down(output.x, 1);
output.y = active.shfl_down(output.y, 1);
output.z = active.shfl_down(output.z, 1);
output.w = active.shfl_down(output.w, 1);
}
I’d like each odd thread in the warp to load in some data from “array” and the have each even thread copy that data. Each odd/even pair of threads in the warp should have the same data.
Basically, each odd thread should copy its higher next door neighbours data.
Currently, my results are unexpected but I don’t see what the problem is with this.
oh, sorry, i edited my message. that said, i’m pretty sure that it will make your program slower compared to straightforward solution. check cuda manual for description of coalesced access. if two lanes of the same warp reads the same memory address, data are just replicated in the load engine. traffic will be the same, and single output = array[id] assignment with right id will be faster than the same assignment prepended by if, all the more so than the full your or mine code
Note that for a shuffle transfer to be valid for a particular lane, both source and destination lanes must be participating in the shuffle operation. Since even and odd lanes are involved here, all 32 threads in the warp must participate.
Note that the load operation here may be affected on windows vs. linux. On windows, ulong4 is a vector type of four 32-bit quantities (which can be pulled in a single vector load per thread). On linux it is a vector type of four 64-bit quantities (which cannot be done in a single vector load per thread). This will have an effect on the load operation.