Transpose 2D matrix with warp shuffle and in-place array

Robert_Crovella · December 18, 2020, 9:49pm

Yes:

#include <cstdio>
__global__ void t(){

  int u[8];
  for (int i = 0; i < 8; i++) u[i] = threadIdx.x*8+i;
  for (int i = 0; i < 8; i++) printf("lane: %d, idx: %d, val: %d\n", threadIdx.x, i, u[i]);
  for (int i = 1; i < 8; i++){
    int idx = threadIdx.x^i;
    u[idx] = __shfl_sync(0x000000FF, u[idx], idx);}
  for (int i = 0; i < 8; i++) printf("lane: %d, idx: %d, tra: %d\n", threadIdx.x, i, u[i]);
}

int main(){

  t<<<1,8>>>();
  cudaDeviceSynchronize();
}

You could use __shfl_xor_sync() here also, but I’m not convinced it makes the code any simpler.

You’re swapping elements across the main diagonal, basically, and the xor pattern gives the right set of hopscotch coverage to hit every element that needs to be swapped. The main diagonal is untouched, which is why we only need 7 loop iterations instead of 8. I think it should be straightforward to extend up to 32x32 transpose across the warp, or any set of power-of-2 square dimensions up to 32.

I presume you have seen this (based on your code). The non-square case indexing is more involved, and offhand I don’t know if it can be easily done without temp arrays.

There is additional discussion of warp-shuffle transpose in this thread.