Is there an efficient pattern for multiple independent reductions in a single warp?

csp256 · March 13, 2016, 3:33am

Let us say each thread in a warp has an array of N floats. (total number of floats: 32*N). I need to compute the sum of all things across the warp which have the same index, and thus have N floats after the reduction, where each of the N outputs has no shared data dependency.

In the special (typical) case of N=1, I would just __shfl_down() by a stride of 1,2,4,8,16 with each followed by an addition. This means 5 shuffles, and 5 additions. (However, the last shuffle only has 1 thread performing useful work, of course.)

For the N>1 case, is it possible to perform N full-warp reductions in less than 5N shuffles and 5N additions? Let us assume for now that I am flexible with where my data ends up.

allanmac · March 13, 2016, 5:18am

Great question. I puzzled over something similar a long time ago and am trying to refresh my memory.

I think the answer is “yes” and you can do better than 5N+5N operations.

The quick sketch for N=2:

row0 += shfl_xor(row0,1)
row1 += shfl_xor(row1,1)
row01 = (lane_idx & 1) == 0 ? row0 : row1

→ Neighboring lanes from row0 are in even lanes and row1 neighbors are in odd lanes.

Continuing:

row01 += shfl_xor(row01,2)

→ Half of the lanes are now redundant and we can work on rows 2 and 3 after we partially reduce them… With clever shuffling you might be able to get them in the right position to just drop into the best “free” slots in the row01 warp when you build row0123.

( hand waving at this point because it feels like it’s going to work for N > 2 )

Here’s a diagram of what I’m talking about with N=2 and warps that are only 4 lanes wide:

_row 0_   _row 1_
0 1 2 3   4 5 6 7 
 X   X     X   X   shfl_xor(1)
0 1 2 3   4 5 6 7 
1 0 3 2   5 4 7 6  +
   \         /     
    \       /      "select" with ternary operator on even/odd                     
     0 5 2 7                            
     1 4 3 6                            
      \ X /        shfl_xor(2)          
       V V                              
       0 5                              
       1 4                              
       2 7                              
       3 6         +

Whatever the case, I would draw this out on paper or diagram it if you have a favorite app.

The routines you would develop for N=2-32 would be:

float csp256_warp_reduce_2(float v0, float v1);
float csp256_warp_reduce_3(float v1, float v2, float v3);
...

You would have to document where the sums end up for each case—most likely they’ll be in the first N lanes of the warp (?).

csp256 · March 13, 2016, 5:57am

Thanks! After writing it down with two different color pens, I see how it works and can verify that it is more efficient (on paper). That is a very clever solution.

I might actually come back and post a detailed solution / implementation, later.

PS: Your blog and numerous posts have been helpful to me in the past, so thank you for that too.

njuffa · March 13, 2016, 2:32pm

@allanmac: +1 for the use of ASCII art :-)

allanmac · March 13, 2016, 4:32pm

Thanks @csp256! A working implementation would be cool to see.

@njuffa: Emacs meta-x picture-mode :)

BulatZiganshin · March 14, 2016, 4:15pm

i had the same problem and changed the whole algo to produce 32 data lines before i start to summarize them. so, with N=32 the entire operation is just 32 additions plus 33*128 bytes of shared memory for holding the data. i even decided to dedicate a separate warp to this process - i.e. 4-8 warps are producing data into first memory array, while at the same time the 9th warp summarize data in the second array and write “sums” to memory

With N=16, you need to perform 16 full-warp additions and then combine two halves of a warp (SHFL+ADD)

With N=8, you are going to perform 8 ADDs followed by 2*(SHFL+ADD)

N=5 is probably best dealed as the sum of N=1 and N=4

however i completely dropped LD/ST operations count which essentially will dominate ALU ops in such approach. It’s a N stores followed by N loads

allanmac · March 14, 2016, 5:14pm

Nice!

csp256 · March 14, 2016, 7:27pm

Here is a first stab at it. The function distance325 loads the value into an element of ssd. This is for a specialized brute force k-NN implementation, so I did not write it in the most general way. However, it should be clear what needs to happen to make it work under more arbitrary circumstance.

The result of the posted code, for each warp, is that the distance squared between query vector i and the training vector is held in ssd[0] by all threads such that (threadIdx.x % 4 == i).

I haven’t profiled this section of the code yet, but it is the hottest part of my entire code base (including CPU code). It is being run by every warp in every block inside of four loops, and I am register limited.

register float ssd[3];

distance325(&ssd[0], query[0], s_training[trainingOffset]);
distance325(&ssd[1], query[1], s_training[trainingOffset]);
ssd[0] += __shfl_xor(ssd[0], 1);
ssd[1] += __shfl_xor(ssd[1], 1);
if (threadIdx.x & 1) {
    ssd[0] = ssd[1];
}

distance325(&ssd[1], query[2], s_training[trainingOffset]);
distance325(&ssd[2], query[3], s_training[trainingOffset]);
ssd[1] += __shfl_xor(ssd[1], 1);
ssd[2] += __shfl_xor(ssd[2], 1);
if (threadIdx.x & 1) {
    ssd[1] = ssd[2];
}

ssd[0] += __shfl_xor(ssd[0], 2);
ssd[1] += __shfl_xor(ssd[1], 2);
if (threadIdx.x & 2) {
    ssd[0] = ssd[1];
}

ssd[0] += __shfl_xor(ssd[0], 4);
ssd[0] += __shfl_xor(ssd[0], 8);
ssd[0] += __shfl_xor(ssd[0], 16);

EDIT: In case anyone finds this on Google, the code below is what ended up being the fastest for me (in trying to compute a large number of Hamming distances between 2048 bit long vectors). Note that each bitvector is stored in 2 Int32’s in each of 32 threads in the warp. Because the magnitude of the Hamming weight is bounded above, we can safely pack two values into each dist variable.

// The compiler throws a hissy fit if you try to make dist an array, and tosses everything into local memory.
                    register int dist0, dist1, dist2, dist3, dist4, dist5, dist6, dist7;
                    // Also, the compiler does not like this being in a (fully unrolled) loop... drama queen.
                    dist0 = __popc(query[0][0] ^ train[0]) + __popc(query[0][1] ^ train[1]);
                    dist1 = __popc(query[1][0] ^ train[0]) + __popc(query[1][1] ^ train[1]);
                    dist2 = __popc(query[2][0] ^ train[0]) + __popc(query[2][1] ^ train[1]);
                    dist3 = __popc(query[3][0] ^ train[0]) + __popc(query[3][1] ^ train[1]);
                    dist4 = __popc(query[4][0] ^ train[0]) + __popc(query[4][1] ^ train[1]);
                    dist5 = __popc(query[5][0] ^ train[0]) + __popc(query[5][1] ^ train[1]);
                    dist6 = __popc(query[6][0] ^ train[0]) + __popc(query[6][1] ^ train[1]);
                    dist7 = __popc(query[7][0] ^ train[0]) + __popc(query[7][1] ^ train[1]);
                    dist0 |= (__popc(query[ 8][0] ^ train[0]) + __popc(query[ 8][1] ^ train[1]))<<16;
                    dist1 |= (__popc(query[ 9][0] ^ train[0]) + __popc(query[ 9][1] ^ train[1]))<<16;
                    dist2 |= (__popc(query[10][0] ^ train[0]) + __popc(query[10][1] ^ train[1]))<<16;
                    dist3 |= (__popc(query[11][0] ^ train[0]) + __popc(query[11][1] ^ train[1]))<<16;
                    dist4 |= (__popc(query[12][0] ^ train[0]) + __popc(query[12][1] ^ train[1]))<<16;
                    dist5 |= (__popc(query[13][0] ^ train[0]) + __popc(query[13][1] ^ train[1]))<<16;
                    dist6 |= (__popc(query[14][0] ^ train[0]) + __popc(query[14][1] ^ train[1]))<<16;
                    dist7 |= (__popc(query[15][0] ^ train[0]) + __popc(query[15][1] ^ train[1]))<<16;

                    dist0 += __shfl_xor(dist0,   1);
                    dist1 += __shfl_xor(dist1,   1);
                    if (threadIdx.x & 1) dist0 = dist1;
                    dist2 += __shfl_xor(dist2,   1);
                    dist3 += __shfl_xor(dist3,   1);
                    if (threadIdx.x & 1) dist2 = dist3;
                    dist4 += __shfl_xor(dist4,   1);
                    dist5 += __shfl_xor(dist5,   1);
                    if (threadIdx.x & 1) dist4 = dist5;
                    dist6 += __shfl_xor(dist6,   1);
                    dist7 += __shfl_xor(dist7,   1);
                    if (threadIdx.x & 1) dist6 = dist7;
                    dist0 += __shfl_xor(dist0,   2);
                    dist2 += __shfl_xor(dist2,   2);
                    if (threadIdx.x & 2) dist0 = dist2;
                    dist4 += __shfl_xor(dist4,   2);
                    dist6 += __shfl_xor(dist6,   2);
                    if (threadIdx.x & 2) dist4 = dist6;
                    dist0 += __shfl_xor(dist0,   4);
                    dist4 += __shfl_xor(dist4,   4);
                    if (threadIdx.x & 4) dist0 = dist4;
                    dist0 += __shfl_xor(dist0,   8);
                    dist0 += __shfl_xor(dist0,  16);
                    if (threadIdx.x < 8) dist0 &= 2047;
                    else dist0 >>= 16;

Threads 0 through 15 now have the Hamming distance between the bitvectors query[laneID] and train stored in dist0. This is several times faster than any alternative I am aware of.

Topic		Replies	Views
Reduction through shared memory vs. shuffle CUDA Programming and Performance	8	6975	April 29, 2014
Can one warp be doing one thing while another warp does something else? CUDA Programming and Performance	6	808	July 11, 2017
How to structure multiple concurrent shuffles in the same block? CUDA Programming and Performance	2	834	February 27, 2019
__shfl_down_sync weird behavior CUDA Programming and Performance cuda , ubuntu	5	1462	November 23, 2021
Fastest single-warp min() algorithm for 32 values CUDA Programming and Performance	15	3671	June 20, 2011
Use __shfl_down_sync for multiple variables? CUDA Programming and Performance	2	710	July 7, 2022
questions about thread execution & volatile CUDA Programming and Performance	19	16896	December 29, 2008
Warp shuffle instruction not working as expected CUDA Programming and Performance	7	820	September 6, 2023
Do I understand the nuances of __syncwarp() and __shfl() correctly? CUDA Programming and Performance	12	303	July 31, 2024
Kernel is slower after using warp shuffles CUDA Programming and Performance	4	360	March 29, 2024

Is there an efficient pattern for multiple independent reductions in a single warp?

Related topics