An Efficient Matrix Transpose in CUDA C/C++

Great way to remove the shared memory Bank conflict.

Hi Mark, I have a question what if the input dimension are not power of 2, then do you pad it to power of 2. Thx

The shape of the shared memory bank of size 32 is slanted due to $TILE_DIM+1$, which prevents bank conflict.

Took some time to understand, but quite fascinated.
Thanks for the nice tip.

Have u interest to bench 2 x 960's on. 2 nodes .. with custom PCIe driver, everything supplied.. needs 3D array RAW in-line.. for 100% in-mem offset pointer meta-access 192GB DB .. u may be up to it? Way I see it 1 GPU on each Xeon Sandy PCI bus in the pair per server, controls absolute array position on a distrib TB mem IB Connect X3 cluster. Now that compute 5 is garden variety, 32 x 32 blocks sync easy within any one node.. I,Ve been looking at the 2690v2 atomic instructions, and can't see the Boolean op I need, a simple compare and replace is only good for bit match, we need range that is efficient to do on GPU. Manual unrolling looks interesting.. great article , well structured.. not often found.. Mark would be with NVidia? U in? Let me kno, Rus

can you please explain this?

Dear sir :
Can we use the one-dimensional shared memeory in this case. so there is no problem of bank confilicting.

Author clearly mentions

For simplicity of presentation, we’ll consider only square matrices whose dimensions are integral multiples of 32 on a side

What happened to the for loops

for (index_t j = 0; j < TILE_DIM; j += BLOCK_ROWS){}

Wow, it has been ages since I last worked on anything CUDA. I think that for-loop is obsolete when you have 32x32 blocks working on 32x32 datatiles because each thread handles exactly one matrix position. The Block scheduler then makes sure that all matrix positions are processed.

My code above works with non-square matrixes, 'though.

How to use dynamic memory instead of static in this?
Can someone help

I tried this sample and observed that the “shared memory copy” kernel is ~18% faster than the direct copy. GV100 and TRX6000 both behave this way. Any thoughts to explain this?

Device : Quadro GV100
Matrix size: 1024 1024, Block size: 32 8, Tile size: 32 32
dimGrid: 32 32 1. dimBlock: 32 8 1
Routine Bandwidth (GB/s)
copy 496.48
shared memory copy 590.63
naive transpose 142.74
coalesced transpose 362.32
conflict-free transpose 590.63