Is it possible to read array values between threads over __shfl()?

Hello,

I’m trying to implement a scheme whereby each thread keeps several small arrays of values, fills them up, and then reads the values from other threads. The picture is roughly like this:

int thr_x_id, thr_y_id, thr_z_id;
float xc[4], yc[4], zc[4];

// tgx holds the thread position within the warp
int tgx = threadIdx.x & 31;

// Assign each thread the parts of xc, yc, and zc to read
thr_x_idx = (tgx & 3);
thr_y_idx = (tgx >> 2) & 3;
thr_z_idx = (tgx & 15);

< fill up xc, yc, zc >
int xbase = < some calculation unique to this thread >;
int ybase = < some calculation unique to this thread >;
int zbase = < some calculation unique to this thread >;
int xyzbase = xbase + (constant for y)*ybase + (constant for x and y)*zbase;
int xyzmove = thr_x_idx + (constant for y)*thr_y_idx + (constant for x and y)*thr_x_idx;

for (i = 0; i < 32; i++) {
  int xyzpt = __shfl(xyzbase) + xyzmove;
  float xyzc = __shfl(xc[thr_x_idx]) * __shfl(yc[thr_y_idx]) * __shfl(zc[thr_z_idx]);
  < atomicAdd xyzc to shared memory >
end

The point of this is that I can do most of the labor outside the for (i = 0; i < 32…) loop, and within the loop have thr_x_idx, thr_y_idx, and thr_z_idx set up so that each thread of the warp is going to pull on a different shared memory bank. Within a warp, no two atomicAdds will conflict, and there will be no shared bank conflicts, all for the price of a couple more _shfl() operations than I would have to do if skipped thr{xyz}_idx and had each thread project its own xc / yc /zc products to shared memory (there would be plenty of conflicts in the latter case due to the structure of the problem and what it does to xyzbase).

However, I’m having some bizarre results. I’m seeing that xyzc comes out the same across all threads, even though it should be different. Is there a problem with using __shfl() to get the values of small arrays stored by each thread? Could it be that the arrays are somehow not actually stored in registers, and the compiler is putting them out in global?

Udpate: it looks like the problem may be that I am attempting to call elements of the arrays using a thread-dependent variable. In this manner, __shfl(xc[0], i) will work, but __shfl(xc[thr_x_idx], i) will return the same value for all threads, regardless of the value of thr_x_idx on that thread. It appears to be that the value of thr_x_idx on thread i is the one used to select the index of xc[thr_x_idx] that is then broadcast to all threads in that case. I need to find a way to get it to use the value of thr_x_idx on the thread reading the result, but that may not be in the cards…

1 Like

This is exactly the behavior I would expect. the only value for thr_x_idx that matters is the one on the selected thread, and according to my testing that certainly does affect the result. According to the documentation:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

“__shfl() Direct copy from indexed lane”

In this case, the indexed lane is given by i

i is the same for all threads (best I can tell from your posted code)

therefore all threads will return the same value for this particular shuffle operation.

xc is a thread-local array.

If you want thread x to be able get xc[y] from thread z, then thread z had better present xc[y] to the shuffle op (and thread x had also better present the appropriate index to select thread z). thread x cannot choose how to index into the xc array that is local to thread z. Thread z makes that choice based on what it presents to the shuffle.

I must confess I don’t really know what you’re after, so perhaps this clarifies nothing. Your original posted code doesn’t present any actual testable case that I can figure out, and in fact constructions like:

__shfl(xc[thr_x_idx])

are not valid CUDA code.

Thanks, and yes, the posted code was something I abstracted from my code, written off the cuff, so there are typos as you pointed out. It appears that I will have to perform thirteen __shfl() operations to have every thread import the shared array index base and then all 12 coefficients, rather than five (each thread imports the shared array index base, three coefficients, and then one more coefficient) but hopefully this won’t hurt performance too much.

Can you perhaps give me an idea of the cost of __shfl() versus mult, add arithmetic?

Cheers!

The documentation:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions__throughput-native-arithmetic-instructions

That’s very useful!