In-place register-based matrix transpose with shuffle not working

nicolau · April 27, 2021, 7:04pm

I’m trying to implement an in-place matrix transform. The code in this thread Transpose 2D matrix with warp shuffle and in-place array probably can solve my problem. I was trying to implement something different, though, and I cannot figure out what is wrong with this kernel.

The idea is that at each iteration we should merge two vectors, say from threads 0 and 8, and store this result in threads 0 and 1, and so on. This process is carried out log(N) times.

For some reason, after the first iteration here the vector in the odd threads ends up the same as the even threads.

Can anyone see a bug in this code (mytranspose), or is there some limitation in shfl_sync or in private arrays that I’m not taking in account?

#include <stdio.h>

template<class T, int N>
__device__  void printvector(T z[N]) {
    for (int th = 0; th < N; th++) {
        if (threadIdx.x == th) {
            for (int n = 0; n < N; n++) {
                printf("% 5d", z[n]);
            }
            printf("\n");
        }
    }
    if (threadIdx.x == 0)printf("\n");
}

template<int N>
__global__ void bobtranspose() {

    int z[N];

    for (int n = 0; n < N; n++) {
        z[n] = n + threadIdx.x * 100;
    }

    printvector<int, N>(z);

    int mask = __activemask();

    for (int i = 1; i < N; i++) {
        int idx = threadIdx.x ^i;
        z[idx] = __shfl_sync(mask, z[idx], idx);
    }

    printvector<int, N>(z);

}

template<int N, int It>
__global__ void mytranspose() {

    int z[N];

    for (int n = 0; n < N; n++) {
        z[n] = n + threadIdx.x * 100;
    }

    printvector<int, N>(z);

    int mask = __activemask();
    int w[N];

    int C2 = N >> 1;
    for (int iteration = 0; iteration < It; iteration++) {
        int inc = (threadIdx.x & 0x1) ? C2 : 0;
        int t1 = threadIdx.x / 2;
        int t2 = t1 + C2;

        for (int v = 0; v < C2; v++) {
            w[v * 2 + 0] = __shfl_sync(mask, z[v+inc], t1);
            w[v * 2 + 1] = __shfl_sync(mask, z[v+inc], t2);
        }

        for (int v = 0; v < N; v++) {
            z[v] = w[v];
        }

        printvector<int, N>(z);
    }
}

#define VEC_LEN 8
int main(int argc, char **argv) {

    printf("bobtranspose\n");
    bobtranspose<VEC_LEN><<<1, VEC_LEN>>>();
    cudaDeviceSynchronize();

    printf("\n\nmytranspose\n");
    mytranspose<VEC_LEN, 1><<<1, VEC_LEN>>>();
    cudaDeviceSynchronize();
}

program output

bobtranspose
    0    1    2    3    4    5    6    7
  100  101  102  103  104  105  106  107
  200  201  202  203  204  205  206  207
  300  301  302  303  304  305  306  307
  400  401  402  403  404  405  406  407
  500  501  502  503  504  505  506  507
  600  601  602  603  604  605  606  607
  700  701  702  703  704  705  706  707

    0  100  200  300  400  500  600  700
    1  101  201  301  401  501  601  701
    2  102  202  302  402  502  602  702
    3  103  203  303  403  503  603  703
    4  104  204  304  404  504  604  704
    5  105  205  305  405  505  605  705
    6  106  206  306  406  506  606  706
    7  107  207  307  407  507  607  707



mytranspose
    0    1    2    3    4    5    6    7
  100  101  102  103  104  105  106  107
  200  201  202  203  204  205  206  207
  300  301  302  303  304  305  306  307
  400  401  402  403  404  405  406  407
  500  501  502  503  504  505  506  507
  600  601  602  603  604  605  606  607
  700  701  702  703  704  705  706  707

    0  400    1  401    2  402    3  403
    0  400    1  401    2  402    3  403
  100  500  100  500  100  500  100  500
  100  500  100  500  100  500  100  500
  200  600  201  601  202  602  203  603
  200  600  201  601  202  602  203  603
  300  700  300  700  300  700  300  700
  300  700  300  700  300  700  300  700

Robert_Crovella · April 27, 2021, 7:43pm

A given thread will draw data from at most 2 other threads:

    int t1 = threadIdx.x / 2;
    int t2 = t1 + C2;

    for (int v = 0; v < C2; v++) {
        w[v * 2 + 0] = __shfl_sync(mask, z[v+inc], t1);
        w[v * 2 + 1] = __shfl_sync(mask, z[v+inc], t2);
    }

how could that possibly work? A given thread needs to draw data from 7 other threads.

nicolau · April 27, 2021, 8:06pm

Not sure what you mean, this is what the unrolled loop should look like in each thread for a vector length of 4:

// thread 0
w[0] = __shfl_sync(mask, z[0], 0);
w[1] = __shfl_sync(mask, z[0], 2);
w[2] = __shfl_sync(mask, z[1], 0);
w[3] = __shfl_sync(mask, z[1], 2);

// thread 1
w[0] = __shfl_sync(mask, z[2], 0);
w[1] = __shfl_sync(mask, z[2], 2);
w[2] = __shfl_sync(mask, z[3], 0);
w[3] = __shfl_sync(mask, z[3], 2);

// thread 2
w[0] = __shfl_sync(mask, z[0], 1);
w[1] = __shfl_sync(mask, z[0], 3);
w[2] = __shfl_sync(mask, z[1], 1);
w[3] = __shfl_sync(mask, z[1], 3);

// thread 3
w[0] = __shfl_sync(mask, z[2], 1);
w[1] = __shfl_sync(mask, z[2], 3);
w[2] = __shfl_sync(mask, z[3], 1);
w[3] = __shfl_sync(mask, z[3], 3);

Robert_Crovella · April 27, 2021, 8:21pm

Thread 0 for a vector length of 4 starts out with (in z):

0    1    2    3

and should end up with (in w):

0    100    200    300

The value of 0, of course, comes from thread 0. The value of 100 comes from thread 1, the value of 200 comes from thread 2, and the value of 300 comes from thread 3. Therefore, if the transpose will have any chance of success, thread 0 had better retrieve a value from threads 1, 2, and 3 (at least, and taking into account that a value cannot cycle from thread A to B to C, in your implementation, because the source is always z and the destination is always w for each shuffle op).

Therefore looking at your unrolled code:

nicolau:

// thread 0
w[0] = __shfl_sync(mask, z[0], 0);  // thread 0 is retrieving a value from the z array owned by thread 0
w[1] = __shfl_sync(mask, z[0], 2);  // thread 0 is retrieving a value from the z array owned by thread 2
w[2] = __shfl_sync(mask, z[1], 0);  // thread 0 is retrieving a value from the z array owned by thread 0
w[3] = __shfl_sync(mask, z[1], 2);  // thread 0 is retrieving a value from the z array owned by thread 2
                               ^
                               |
                           source lane

I’ve added comments to explain what I mean. The last argument to the __shfl_sync op is the source lane. Since you are sourcing values from lanes (threads) 0 and 2 only, there is no way that thread 0 could retrieve a value from lane (thread) 1 or 3, but we’ve already established this is necessary for the transpose to work.

Your source lane pattern could not possibly be correct.

nicolau · April 27, 2021, 8:36pm

This is a single iteration, you need log(n) iterations to get the final transposed matrix. Sorry for the confusion.

The reason I limited to the first iteration is because you can already see the problem going on there. Thread 0 contains the correct data (for the first iteration), but thread 1 contains a copy of that:

   0  400    1  401    2  402    3  403
   0  400    1  401    2  402    3  403

The correct values should be:

   0  400    1  401    2  402    3  403
   4  404    5  405    6  406    7  407

Robert_Crovella · April 27, 2021, 9:05pm

OK.

The reason your first two threads are retrieving the same values is because those are the values that are being published from the source lane. A common misconception is that the destination lane somehow gets to “select” the value from an array published in the source lane. It does not. On any given shfl_sync op, the destination lane picks up whatever value is published by the source lane for that instruction.

Let’s go through your code with this in mind. Let’s choose lane 1 as an example, because you seem to think it should differ from the lane 0 results.

for (int v = 0; v < C2; v++) {
    w[v * 2 + 0] = __shfl_sync(mask, z[v+inc], t1);  // line 1
    w[v * 2 + 1] = __shfl_sync(mask, z[v+inc], t2);  // line 2
}

For thread 0 and for thread 1, lets work with a vector length of 4, therefore C2 = 2, t1 = 0, and t2 = 0. For thread 0, inc = 0, and for thread 1, inc = 2.

On the first pass of the loop, at line 1, thread 0 picks up whatever value is published by thread 0, and thread 1 picks up the same value. Thread 0 is publishing its own z[0] at that point (the 2nd argument), and both thread 0 and thread 1 have 0 for the source lane, so they both pick up thread 0’s z[0]. If you work thru the next line and the next loop iteration, you will see that both thread 0 and thread 1 have the same source lane pattern (whether 0 or 2) and therefore they pick up whatever value the source lane is publishing. Therefore they pick up the same value. Therefore the ending results are the same (between thread 0 and 1).

It’s easy to get confused thinking that:

in that situation, that thread 1 is “asking” for z[2] from thread 0. That is not what is happening. Thread 1 is publishing its z[2]. At the same time, thread 1 is retrieving whatever value is being published by thread 0. That value happens to be thread 0’s z[0] (same as what thread 0 gets at that point).

Here’s another way to think about it:

nicolau:

// thread 0
w[0] = __shfl_sync(mask, z[0], 0);  //thread 0 deposits the value published by thread 0 in w[0]
w[1] = __shfl_sync(mask, z[0], 2);  //thread 0 deposits the value published by thread 2 in w[1]
w[2] = __shfl_sync(mask, z[1], 0);  //thread 0 deposits the value published by thread 0 in w[2]
w[3] = __shfl_sync(mask, z[1], 2);  //thread 0 deposits the value published by thread 2 in w[3]

// thread 1
w[0] = __shfl_sync(mask, z[2], 0);  //thread 1 deposits the value published by thread 0 in w[0]
w[1] = __shfl_sync(mask, z[2], 2);  //thread 1 deposits the value published by thread 2 in w[1]
w[2] = __shfl_sync(mask, z[3], 0);  //thread 1 deposits the value published by thread 0 in w[2]
w[3] = __shfl_sync(mask, z[3], 2);  //thread 1 deposits the value published by thread 2 in w[3]

Hopefully with that commentary you can see that thread 0 and thread 1 will end up with the same values in w array.

nicolau · April 28, 2021, 5:05pm

Oh, I see now, thanks. I guess I’ll just stick to the xor way for now!

Topic		Replies	Views
Transpose 2D matrix with warp shuffle and in-place array CUDA Programming and Performance	3	1965	October 12, 2021
Warp shuffle instruction not working as expected CUDA Programming and Performance	7	865	September 6, 2023
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	9014	November 29, 2013
Implement 2D matrix transpose using warp shuffle without local memory CUDA Programming and Performance cuda	21	4590	December 19, 2023
CUDA Matrix Transpose only with warp shuffle instructions not working CUDA Programming and Performance	1	327	January 2, 2023
Transposing register-held matrices with warp shuffles? Need help. CUDA Programming and Performance	7	3992	July 27, 2023
Question about tranpose CUDA Programming and Performance	19	7347	June 11, 2008
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2744	September 18, 2009
Efficient in-place transpose of multiple square float matrices CUDA Programming and Performance	10	5565	October 10, 2019
need help urgently, inconsistent result? half the time the result is right and half the time wrong CUDA Programming and Performance	4	925	February 24, 2011

In-place register-based matrix transpose with shuffle not working

Related topics