Transpose bitwise x:y matrix (x!=y)

geohei · October 2, 2021, 10:52am

I need to transpose x[] to y[] inside a kernel.
typedef and dimension of both arrays differ (but product is identical).

uint16_t x[32] = { 0 };
uint32_t y[16] = { 0 };
for (i = 0; i < 16; i++) {
	for (int j = 0; j < 32; j++) {
			y[i] |= (uint32_t) ((x[j] >> (15 - i)) & 0x1) << j;
	}
}

x:  16         0
00  0xxx....xxxx
	1xxx....xxxx
	............
32  xxxxxxxxxxxx
	
y:  32         0
00  xxxx....3210
	xxxx....xxxx
	............
16  xxxxxxxxxxxx

My code above is not very efficient. There are solutions like this here, but that’s square matrix single array.

Are there are specific GPU PTX/SASS commands which I can use to optimize my matrix? It doesn’t need to be a C solution. Assembler is fine.

rs277 · October 2, 2021, 6:03pm

Treat x and y as each being made up of two 16 x 16 bit matrices, then you can apply any of the conventional square transpose algorithms.

The book “Hackers Delight” by Henry Warren has a number of solutions, including the swapping method outlined in your link.

Later:

Not really. Cuda is somewhat light on byte oriented instructions, let alone dealing at the bit level. That said, if you are able to read the above mentioned book, you’ll find that Cuda’s LOP3 instruction soaks up a decent quantity of conventional 2 operand logic instructions and the SHF instruction, accessed via __funnelshift_r or __funnelshift_l produces a decent result for 32 x 32 bit transposes, as outlined.

geohei · October 3, 2021, 4:38pm

Many thanks for the hint about Hackers Delight. Looks like what I’m looking for. Unfortunately I didn’t find any online excerpts from the book (2nd edition). I’ll keep on looking …

Regarding the matrix issue in OP - I did what you suggested. Even with the split in two 16:16 matrices, two transpose calls, and the merge back to one matrix, made the code 15% faster than my code in OP. I didn’t expect such a speed increase. Many thanks!

This brings up the question, why my code in OP was so slow?

rs277 · October 3, 2021, 7:12pm

I may have given a wrong impression, that the book offers Cuda solutions, which it doesn’t. What I meant was that the C code implementations outlined, when compiled to SASS, result in much reduced instruction counts, due to LOP3.

The book takes the swapping method one step further in efficiency, by simplifying the swap function, (which halves the instruction count of the swap), and adding rotate instructions - hence the reference to SHF, which can be used for 32 bit rotates if both input operands are the same.

Explaining and showing the code for the above is probably a bit much for the forum, but if you’re doing much “bit twiddling” or similar optimization work, that book is well worth obtaining if possible.

njuffa · October 3, 2021, 8:14pm

Before adding intrinsics for funnel shifts to the code, I would check whether the CUDA compiler recognizes common rotate source code idioms, because many compilers these days do, and the CUDA compiler is based on LLVM which is compiler infrastructure also used by other toolchains.

Anybody who regularly deals with bit-twiddling in their programming ought to acquire a copy of Hacker’s Delight. You would want to get the second (and final; the author passed away in 2018) edition from 2013.

The need to transpose a matrix may be indicative of an XY-problem. There is rarely a need to transpose matrices explicitly, instead the work can usually be incorporated into whatever operations involve the matrix later. That’s how the transposition modes for matrices in BLAS work, for example.

rs277 · October 3, 2021, 8:29pm

To which I’d add they should then visit:

to obtain the errata for the book. In fact the “rotate” transpose method mentioned above, which is only partially described in the book and no sample code, actually does have an error in the description, for the 2nd Edition.

In my case, transposition is being used in some bitsliced work - perhaps an exception.

njuffa · October 3, 2021, 8:36pm

A rare case, obviously :-)

As always, these are just heuristics developed over decades of programming. Whenever someone asks “What is the fastest way to bulk copy data?” or “What is the fastest way to invert/transpose a matrix?” that is a red flag to me that prompts the counter question “Why would you want to do that?”

That does not mean that it is never a good idea to bulk copy data or invert/transpose matrices.

geohei · October 4, 2021, 7:46am

Hackers Delight is ordered.
I’ll check my code again to see if/how I can omit the matrix transpose.
Thanks so far!

Topic		Replies	Views
Implement 2D matrix transpose using warp shuffle without local memory CUDA Programming and Performance cuda	20	5211	December 19, 2023
How to do a in-wrap transpose mma C&D matrix? CUDA Programming and Performance cuda , kernel	8	210	November 11, 2025
Transposing register-held matrices with warp shuffles? Need help. CUDA Programming and Performance	7	4429	July 27, 2023
Transpose 2D matrix with warp shuffle and in-place array CUDA Programming and Performance	2	2115	December 19, 2020
Doubling the speed of the SDK transpose CUDA Programming and Performance	16	6482	October 15, 2008
Optimizing the transpose code CUDA Programming and Performance	0	2949	August 23, 2011
Transpose matrix like 8x1M in bytes by memcpy2d CUDA Programming and Performance cuda	10	221	November 13, 2024
Efficient in-place transpose of multiple square float matrices CUDA Programming and Performance	10	5947	October 10, 2019
aX + btranspose(X) CUDA Programming and Performance	1	3227	January 26, 2010
Efficient Matrix Transpose off-the-shelf solution? CUDA Programming and Performance	0	996	May 15, 2009

Transpose bitwise x:y matrix (x!=y)

Related topics