In-place register-based matrix transpose with shuffle not working

I’m trying to implement an in-place matrix transform. The code in this thread Transpose 2D matrix with warp shuffle and in-place array probably can solve my problem. I was trying to implement something different, though, and I cannot figure out what is wrong with this kernel.

The idea is that at each iteration we should merge two vectors, say from threads 0 and 8, and store this result in threads 0 and 1, and so on. This process is carried out log(N) times.

For some reason, after the first iteration here the vector in the odd threads ends up the same as the even threads.

Can anyone see a bug in this code (mytranspose), or is there some limitation in shfl_sync or in private arrays that I’m not taking in account?

#include <stdio.h>

template<class T, int N>
__device__  void printvector(T z[N]) {
    for (int th = 0; th < N; th++) {
        if (threadIdx.x == th) {
            for (int n = 0; n < N; n++) {
                printf("% 5d", z[n]);
            }
            printf("\n");
        }
    }
    if (threadIdx.x == 0)printf("\n");
}

template<int N>
__global__ void bobtranspose() {

    int z[N];

    for (int n = 0; n < N; n++) {
        z[n] = n + threadIdx.x * 100;
    }

    printvector<int, N>(z);

    int mask = __activemask();

    for (int i = 1; i < N; i++) {
        int idx = threadIdx.x ^i;
        z[idx] = __shfl_sync(mask, z[idx], idx);
    }

    printvector<int, N>(z);

}

template<int N, int It>
__global__ void mytranspose() {

    int z[N];

    for (int n = 0; n < N; n++) {
        z[n] = n + threadIdx.x * 100;
    }

    printvector<int, N>(z);

    int mask = __activemask();
    int w[N];

    int C2 = N >> 1;
    for (int iteration = 0; iteration < It; iteration++) {
        int inc = (threadIdx.x & 0x1) ? C2 : 0;
        int t1 = threadIdx.x / 2;
        int t2 = t1 + C2;

        for (int v = 0; v < C2; v++) {
            w[v * 2 + 0] = __shfl_sync(mask, z[v+inc], t1);
            w[v * 2 + 1] = __shfl_sync(mask, z[v+inc], t2);
        }

        for (int v = 0; v < N; v++) {
            z[v] = w[v];
        }

        printvector<int, N>(z);
    }
}

#define VEC_LEN 8
int main(int argc, char **argv) {

    printf("bobtranspose\n");
    bobtranspose<VEC_LEN><<<1, VEC_LEN>>>();
    cudaDeviceSynchronize();

    printf("\n\nmytranspose\n");
    mytranspose<VEC_LEN, 1><<<1, VEC_LEN>>>();
    cudaDeviceSynchronize();
}

program output

bobtranspose
    0    1    2    3    4    5    6    7
  100  101  102  103  104  105  106  107
  200  201  202  203  204  205  206  207
  300  301  302  303  304  305  306  307
  400  401  402  403  404  405  406  407
  500  501  502  503  504  505  506  507
  600  601  602  603  604  605  606  607
  700  701  702  703  704  705  706  707

    0  100  200  300  400  500  600  700
    1  101  201  301  401  501  601  701
    2  102  202  302  402  502  602  702
    3  103  203  303  403  503  603  703
    4  104  204  304  404  504  604  704
    5  105  205  305  405  505  605  705
    6  106  206  306  406  506  606  706
    7  107  207  307  407  507  607  707



mytranspose
    0    1    2    3    4    5    6    7
  100  101  102  103  104  105  106  107
  200  201  202  203  204  205  206  207
  300  301  302  303  304  305  306  307
  400  401  402  403  404  405  406  407
  500  501  502  503  504  505  506  507
  600  601  602  603  604  605  606  607
  700  701  702  703  704  705  706  707

    0  400    1  401    2  402    3  403
    0  400    1  401    2  402    3  403
  100  500  100  500  100  500  100  500
  100  500  100  500  100  500  100  500
  200  600  201  601  202  602  203  603
  200  600  201  601  202  602  203  603
  300  700  300  700  300  700  300  700
  300  700  300  700  300  700  300  700

A given thread will draw data from at most 2 other threads:

    int t1 = threadIdx.x / 2;
    int t2 = t1 + C2;

    for (int v = 0; v < C2; v++) {
        w[v * 2 + 0] = __shfl_sync(mask, z[v+inc], t1);
        w[v * 2 + 1] = __shfl_sync(mask, z[v+inc], t2);
    }

how could that possibly work? A given thread needs to draw data from 7 other threads.

Not sure what you mean, this is what the unrolled loop should look like in each thread for a vector length of 4:

// thread 0
w[0] = __shfl_sync(mask, z[0], 0);
w[1] = __shfl_sync(mask, z[0], 2);
w[2] = __shfl_sync(mask, z[1], 0);
w[3] = __shfl_sync(mask, z[1], 2);

// thread 1
w[0] = __shfl_sync(mask, z[2], 0);
w[1] = __shfl_sync(mask, z[2], 2);
w[2] = __shfl_sync(mask, z[3], 0);
w[3] = __shfl_sync(mask, z[3], 2);

// thread 2
w[0] = __shfl_sync(mask, z[0], 1);
w[1] = __shfl_sync(mask, z[0], 3);
w[2] = __shfl_sync(mask, z[1], 1);
w[3] = __shfl_sync(mask, z[1], 3);

// thread 3
w[0] = __shfl_sync(mask, z[2], 1);
w[1] = __shfl_sync(mask, z[2], 3);
w[2] = __shfl_sync(mask, z[3], 1);
w[3] = __shfl_sync(mask, z[3], 3);

Thread 0 for a vector length of 4 starts out with (in z):

0    1    2    3

and should end up with (in w):

0    100    200    300

The value of 0, of course, comes from thread 0. The value of 100 comes from thread 1, the value of 200 comes from thread 2, and the value of 300 comes from thread 3. Therefore, if the transpose will have any chance of success, thread 0 had better retrieve a value from threads 1, 2, and 3 (at least, and taking into account that a value cannot cycle from thread A to B to C, in your implementation, because the source is always z and the destination is always w for each shuffle op).

Therefore looking at your unrolled code:

I’ve added comments to explain what I mean. The last argument to the __shfl_sync op is the source lane. Since you are sourcing values from lanes (threads) 0 and 2 only, there is no way that thread 0 could retrieve a value from lane (thread) 1 or 3, but we’ve already established this is necessary for the transpose to work.

Your source lane pattern could not possibly be correct.

This is a single iteration, you need log(n) iterations to get the final transposed matrix. Sorry for the confusion.

The reason I limited to the first iteration is because you can already see the problem going on there. Thread 0 contains the correct data (for the first iteration), but thread 1 contains a copy of that:

   0  400    1  401    2  402    3  403
   0  400    1  401    2  402    3  403

The correct values should be:

   0  400    1  401    2  402    3  403
   4  404    5  405    6  406    7  407

OK.

The reason your first two threads are retrieving the same values is because those are the values that are being published from the source lane. A common misconception is that the destination lane somehow gets to “select” the value from an array published in the source lane. It does not. On any given shfl_sync op, the destination lane picks up whatever value is published by the source lane for that instruction.

Let’s go through your code with this in mind. Let’s choose lane 1 as an example, because you seem to think it should differ from the lane 0 results.

for (int v = 0; v < C2; v++) {
    w[v * 2 + 0] = __shfl_sync(mask, z[v+inc], t1);  // line 1
    w[v * 2 + 1] = __shfl_sync(mask, z[v+inc], t2);  // line 2
}

For thread 0 and for thread 1, lets work with a vector length of 4, therefore C2 = 2, t1 = 0, and t2 = 0. For thread 0, inc = 0, and for thread 1, inc = 2.

On the first pass of the loop, at line 1, thread 0 picks up whatever value is published by thread 0, and thread 1 picks up the same value. Thread 0 is publishing its own z[0] at that point (the 2nd argument), and both thread 0 and thread 1 have 0 for the source lane, so they both pick up thread 0’s z[0]. If you work thru the next line and the next loop iteration, you will see that both thread 0 and thread 1 have the same source lane pattern (whether 0 or 2) and therefore they pick up whatever value the source lane is publishing. Therefore they pick up the same value. Therefore the ending results are the same (between thread 0 and 1).

It’s easy to get confused thinking that:

in that situation, that thread 1 is “asking” for z[2] from thread 0. That is not what is happening. Thread 1 is publishing its z[2]. At the same time, thread 1 is retrieving whatever value is being published by thread 0. That value happens to be thread 0’s z[0] (same as what thread 0 gets at that point).

Here’s another way to think about it:

Hopefully with that commentary you can see that thread 0 and thread 1 will end up with the same values in w array.

Oh, I see now, thanks. I guess I’ll just stick to the xor way for now!