Transpose bitwise x:y matrix (x!=y)

I need to transpose x[] to y[] inside a kernel.
typedef and dimension of both arrays differ (but product is identical).

uint16_t x[32] = { 0 };
uint32_t y[16] = { 0 };
for (i = 0; i < 16; i++) {
	for (int j = 0; j < 32; j++) {
			y[i] |= (uint32_t) ((x[j] >> (15 - i)) & 0x1) << j;

x:  16         0
00  0xxx....xxxx
32  xxxxxxxxxxxx
y:  32         0
00  xxxx....3210
16  xxxxxxxxxxxx

My code above is not very efficient. There are solutions like this here, but that’s square matrix single array.

Are there are specific GPU PTX/SASS commands which I can use to optimize my matrix? It doesn’t need to be a C solution. Assembler is fine.

Treat x and y as each being made up of two 16 x 16 bit matrices, then you can apply any of the conventional square transpose algorithms.

The book “Hackers Delight” by Henry Warren has a number of solutions, including the swapping method outlined in your link.


Not really. Cuda is somewhat light on byte oriented instructions, let alone dealing at the bit level. That said, if you are able to read the above mentioned book, you’ll find that Cuda’s LOP3 instruction soaks up a decent quantity of conventional 2 operand logic instructions and the SHF instruction, accessed via __funnelshift_r or __funnelshift_l produces a decent result for 32 x 32 bit transposes, as outlined.

1 Like

Many thanks for the hint about Hackers Delight. Looks like what I’m looking for. Unfortunately I didn’t find any online excerpts from the book (2nd edition). I’ll keep on looking …

Regarding the matrix issue in OP - I did what you suggested. Even with the split in two 16:16 matrices, two transpose calls, and the merge back to one matrix, made the code 15% faster than my code in OP. I didn’t expect such a speed increase. Many thanks!

This brings up the question, why my code in OP was so slow?

I may have given a wrong impression, that the book offers Cuda solutions, which it doesn’t. What I meant was that the C code implementations outlined, when compiled to SASS, result in much reduced instruction counts, due to LOP3.

The book takes the swapping method one step further in efficiency, by simplifying the swap function, (which halves the instruction count of the swap), and adding rotate instructions - hence the reference to SHF, which can be used for 32 bit rotates if both input operands are the same.

Explaining and showing the code for the above is probably a bit much for the forum, but if you’re doing much “bit twiddling” or similar optimization work, that book is well worth obtaining if possible.

1 Like

Before adding intrinsics for funnel shifts to the code, I would check whether the CUDA compiler recognizes common rotate source code idioms, because many compilers these days do, and the CUDA compiler is based on LLVM which is compiler infrastructure also used by other toolchains.

Anybody who regularly deals with bit-twiddling in their programming ought to acquire a copy of Hacker’s Delight. You would want to get the second (and final; the author passed away in 2018) edition from 2013.

The need to transpose a matrix may be indicative of an XY-problem. There is rarely a need to transpose matrices explicitly, instead the work can usually be incorporated into whatever operations involve the matrix later. That’s how the transposition modes for matrices in BLAS work, for example.

1 Like

To which I’d add they should then visit:

to obtain the errata for the book. In fact the “rotate” transpose method mentioned above, which is only partially described in the book and no sample code, actually does have an error in the description, for the 2nd Edition.

In my case, transposition is being used in some bitsliced work - perhaps an exception.

1 Like

A rare case, obviously :-)

As always, these are just heuristics developed over decades of programming. Whenever someone asks “What is the fastest way to bulk copy data?” or “What is the fastest way to invert/transpose a matrix?” that is a red flag to me that prompts the counter question “Why would you want to do that?”

That does not mean that it is never a good idea to bulk copy data or invert/transpose matrices.


Hackers Delight is ordered.
I’ll check my code again to see if/how I can omit the matrix transpose.
Thanks so far!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.