Hi, at our company we have started an effort do massive MIMO simulations (up to 32 antenna ports) with CUDA. It turned out that starting from 8x8 matrices, the register consumption for complex valued matrices went through the roof (and most of it spilled into local memory). So I’ve had to redesign our matrix classes to distribute the matrix columns across threads.

My matrix class is templated (with matrix dimensions being template parameters). For loop unrolling over template parameters I use the unrolling trick found here - using C++11 lambda expressions in CUDA 7.0 for better code readability: http://www.codeproject.com/Articles/75423/Loop-Unrolling-over-Template-Arguments

So I’ve come up with the following data layout, shown exemplary for a 4x4 matrix type.

Here 4 consecutive threads contain one matrix. I managed to implement addition, multiplication, inverse (using LU decomposition) just fine. I make heavy use of warp shuffles at various stages.

```
.........tid
0 1 2 3 4 5 6 7 ... 31
row[0] a11 a12 a13 a14 b11 b12 b13 b14
row[1] a21 a22 a23 a24 b21 b22 b23 b24
row[2] a31 a32 a33 a34 b31 b32 b33 b34
row[3] a41 a42 a43 a44 b41 b42 b43 b44
```

What I cannot currently wrap my head around is matrix transposition. So far I have only managed to implement this with shared memory. Again, my shared memory consumption becomes way too high starting with 16x16 and 32x32 matrices - to the point where using shared memory becomes infeasible when one wants to use reasonably big block sizes (>=256 threads).

Would anyone have an idea how to transpose these matrices using warp shuffles? I keep running into brick walls trying to implement this with shuffles. The matrices are supposed to stay in registers all the time, there will be no loads/stores to global memory. Hence the tricks outlined in the pixel.io blog (http://www.pixel.io/blog/2013/3/25/fast-matrix-transposition-on-kepler-without-using-shared-mem.html) won’t work for me, as they involve global memory transactions.

Christian