Transposing register-held matrices with warp shuffles? Need help.

cbuchner1 · July 8, 2015, 8:50am

Hi, at our company we have started an effort do massive MIMO simulations (up to 32 antenna ports) with CUDA. It turned out that starting from 8x8 matrices, the register consumption for complex valued matrices went through the roof (and most of it spilled into local memory). So I’ve had to redesign our matrix classes to distribute the matrix columns across threads.

My matrix class is templated (with matrix dimensions being template parameters). For loop unrolling over template parameters I use the unrolling trick found here - using C++11 lambda expressions in CUDA 7.0 for better code readability: Loop Unrolling over Template Arguments - CodeProject

So I’ve come up with the following data layout, shown exemplary for a 4x4 matrix type.
Here 4 consecutive threads contain one matrix. I managed to implement addition, multiplication, inverse (using LU decomposition) just fine. I make heavy use of warp shuffles at various stages.

.........tid
         0   1   2   3    4   5   6   7   ... 31
row[0]   a11 a12 a13 a14  b11 b12 b13 b14
row[1]   a21 a22 a23 a24  b21 b22 b23 b24
row[2]   a31 a32 a33 a34  b31 b32 b33 b34
row[3]   a41 a42 a43 a44  b41 b42 b43 b44

What I cannot currently wrap my head around is matrix transposition. So far I have only managed to implement this with shared memory. Again, my shared memory consumption becomes way too high starting with 16x16 and 32x32 matrices - to the point where using shared memory becomes infeasible when one wants to use reasonably big block sizes (>=256 threads).

Would anyone have an idea how to transpose these matrices using warp shuffles? I keep running into brick walls trying to implement this with shuffles. The matrices are supposed to stay in registers all the time, there will be no loads/stores to global memory. Hence the tricks outlined in the pixel.io blog (http://www.pixel.io/blog/2013/3/25/fast-matrix-transposition-on-kepler-without-using-shared-mem.html) won’t work for me, as they involve global memory transactions.

Christian

monoid · July 8, 2015, 9:48am

I have successfully implemented something similar before. Besides shuffles, the generated code is going to use “select” instructions and predicated register assignments.

For me the key thing to realize was the following. To transpose a power-of-two sized matrix, decompose the problem into:

transposing a 2x2 block matrix

A B  ->  A C
C D      B D

transposing each of 4 sub-blocks recursively, in parallel.

So e.g. to transpose a 32x32 matrix held in registers, I had 5 stages shuffle-exchanging matrices of sizes 16x16…1x1.

Does that help?

cbuchner1 · July 8, 2015, 10:13am

The recursive block based approach certainly is an interesting one. I will think about it.

EDIT: hmm, the recursive approach might create a few problems with non-square and non-power-of-2 sized matrices.

monoid · July 8, 2015, 12:42pm

Your opening post only mentioned square power-of-two matrices. I’d say the recursive approach is not applicable as-is for other kinds (or do you already know how to extend it?)

What actual matrix sizes do you want to handle?

cbuchner1 · July 8, 2015, 1:16pm

As the classes are not yet fully developed, I only tested with square matrices at the moment. My template parameters are <H,W> giving the height (no. of rows) and width (no. of columns) of the matrix.

I think I will have to introduce a third template parameter S (stride) giving the thread stride distance between matrices.

In order to multiply e.g. 2x4 and 4x2 matrices (Aa, Bb, …), the second operand would need to be stored with a stride of 4, leaving 2 threads unused in the 2nd operand. This is to ensure alignment of threads between operands.

tid
0   1   2   3    4   5   6   7    ...

first operand
A11 A12 A13 A14  B11 B12 B13 B14  ...
A21 A22 A23 A24  B21 B22 B23 B24  ...

second operand at stride 4 (--- means thread does not hold valid data)
a11 a12 --- ---  b11 b12 --- ---  ...
a21 a22 --- ---  b21 b22 --- ---  ...
a31 a32 --- ---  b31 b32 --- ---  ...
a41 a42 --- ---  b41 b42 --- ---  ...

Due to limitations of the warp shuffle’s “width” argument, I think the stride always has to be a power of 2.

I still want to support arbitrary matrix sizes up to 32 if possible (not just powers of 2).

Christian

monoid · July 9, 2015, 9:36am

Not sure if you’ve found this already, but there’s a library for fairly generic shuffle-based exchanges here: GitHub - bryancatanzaro/trove: Full-speed Array of Structures access

The README says it’s primarily for SoA/AoS conversions, but might be applicable for your case as well.

I haven’t tried it myself, so I can’t comment what kind of code it produces.

dseifert · July 26, 2023, 12:52pm

I have implemented the idea from @monoid in a CCCL PR: Add WarpExchangeAlgorithm customization for WarpExchange class by pb-dseifert · Pull Request #256 · NVIDIA/cccl · GitHub

Shaquille · July 27, 2023, 2:34am

thank you, you are right

Topic		Replies	Views
Implement 2D matrix transpose using warp shuffle without local memory CUDA Programming and Performance cuda	21	4962	December 19, 2023
Transpose 2D matrix with warp shuffle and in-place array CUDA Programming and Performance	3	2043	October 12, 2021
CUDA Matrix Transpose only with warp shuffle instructions not working CUDA Programming and Performance	1	365	January 2, 2023
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	9166	November 29, 2013
Matix Multiplication using __shfl? CUDA Programming and Performance	8	3257	August 29, 2016
In-place register-based matrix transpose with shuffle not working CUDA Programming and Performance	7	1216	October 12, 2021
Warp shuffle instruction not working as expected CUDA Programming and Performance	7	936	September 6, 2023
CUDA matrix transpose using shared memory CUDA Programming and Performance cuda	0	520	July 6, 2020
How to do a in-wrap transpose mma C&D matrix? CUDA Programming and Performance cuda , kernel	9	118	November 12, 2025
Bitwise Transpose using Warp Shuffles CUDA Programming and Performance	1	108	January 27, 2025

Transposing register-held matrices with warp shuffles? Need help.

Related topics