Transposing register-held matrices with warp shuffles? Need help.

Hi, at our company we have started an effort do massive MIMO simulations (up to 32 antenna ports) with CUDA. It turned out that starting from 8x8 matrices, the register consumption for complex valued matrices went through the roof (and most of it spilled into local memory). So I’ve had to redesign our matrix classes to distribute the matrix columns across threads.

My matrix class is templated (with matrix dimensions being template parameters). For loop unrolling over template parameters I use the unrolling trick found here - using C++11 lambda expressions in CUDA 7.0 for better code readability:

So I’ve come up with the following data layout, shown exemplary for a 4x4 matrix type.
Here 4 consecutive threads contain one matrix. I managed to implement addition, multiplication, inverse (using LU decomposition) just fine. I make heavy use of warp shuffles at various stages.

         0   1   2   3    4   5   6   7   ... 31
row[0]   a11 a12 a13 a14  b11 b12 b13 b14
row[1]   a21 a22 a23 a24  b21 b22 b23 b24
row[2]   a31 a32 a33 a34  b31 b32 b33 b34
row[3]   a41 a42 a43 a44  b41 b42 b43 b44

What I cannot currently wrap my head around is matrix transposition. So far I have only managed to implement this with shared memory. Again, my shared memory consumption becomes way too high starting with 16x16 and 32x32 matrices - to the point where using shared memory becomes infeasible when one wants to use reasonably big block sizes (>=256 threads).

Would anyone have an idea how to transpose these matrices using warp shuffles? I keep running into brick walls trying to implement this with shuffles. The matrices are supposed to stay in registers all the time, there will be no loads/stores to global memory. Hence the tricks outlined in the blog ( won’t work for me, as they involve global memory transactions.


I have successfully implemented something similar before. Besides shuffles, the generated code is going to use “select” instructions and predicated register assignments.

For me the key thing to realize was the following. To transpose a power-of-two sized matrix, decompose the problem into:

  1. transposing a 2x2 block matrix
A B  ->  A C
C D      B D
  1. transposing each of 4 sub-blocks recursively, in parallel.

So e.g. to transpose a 32x32 matrix held in registers, I had 5 stages shuffle-exchanging matrices of sizes 16x16…1x1.

Does that help?

The recursive block based approach certainly is an interesting one. I will think about it.

EDIT: hmm, the recursive approach might create a few problems with non-square and non-power-of-2 sized matrices.

Your opening post only mentioned square power-of-two matrices. I’d say the recursive approach is not applicable as-is for other kinds (or do you already know how to extend it?)

What actual matrix sizes do you want to handle?

As the classes are not yet fully developed, I only tested with square matrices at the moment. My template parameters are <H,W> giving the height (no. of rows) and width (no. of columns) of the matrix.

I think I will have to introduce a third template parameter S (stride) giving the thread stride distance between matrices.

In order to multiply e.g. 2x4 and 4x2 matrices (Aa, Bb, …), the second operand would need to be stored with a stride of 4, leaving 2 threads unused in the 2nd operand. This is to ensure alignment of threads between operands.

0   1   2   3    4   5   6   7    ...

first operand
A11 A12 A13 A14  B11 B12 B13 B14  ...
A21 A22 A23 A24  B21 B22 B23 B24  ...

second operand at stride 4 (--- means thread does not hold valid data)
a11 a12 --- ---  b11 b12 --- ---  ...
a21 a22 --- ---  b21 b22 --- ---  ...
a31 a32 --- ---  b31 b32 --- ---  ...
a41 a42 --- ---  b41 b42 --- ---  ...

Due to limitations of the warp shuffle’s “width” argument, I think the stride always has to be a power of 2.

I still want to support arbitrary matrix sizes up to 32 if possible (not just powers of 2).


Not sure if you’ve found this already, but there’s a library for fairly generic shuffle-based exchanges here:

The README says it’s primarily for SoA/AoS conversions, but might be applicable for your case as well.

I haven’t tried it myself, so I can’t comment what kind of code it produces.