An Efficient Matrix Transpose in CUDA C/C++

anon38393574 · April 10, 2017, 5:49pm

Great way to remove the shared memory Bank conflict.

anon38393574 · April 10, 2017, 5:52pm

Hi Mark, I have a question what if the input dimension are not power of 2, then do you pad it to power of 2. Thx

anon76471545 · September 28, 2017, 9:53am

The shape of the shared memory bank of size 32 is slanted due to $TILE_DIM+1$, which prevents bank conflict.

Took some time to understand, but quite fascinated.
Thanks for the nice tip.

anon64257577 · December 29, 2017, 4:59pm

Have u interest to bench 2 x 960's on. 2 nodes .. with custom PCIe driver, everything supplied.. needs 3D array RAW in-line.. for 100% in-mem offset pointer meta-access 192GB DB .. u may be up to it? Way I see it 1 GPU on each Xeon Sandy PCI bus in the pair per server, controls absolute array position on a distrib TB mem IB Connect X3 cluster. Now that compute 5 is garden variety, 32 x 32 blocks sync easy within any one node.. I,Ve been looking at the 2690v2 atomic instructions, and can't see the Boolean op I need, a simple compare and replace is only good for bit match, we need range that is efficient to do on GPU. Manual unrolling looks interesting.. great article , well structured.. not often found.. Mark would be with NVidia? U in? Let me kno, Rus

anon79466678 · February 12, 2019, 1:49pm

can you please explain this?

anon83042957 · April 14, 2019, 7:24am

Dear sir :
Can we use the one-dimensional shared memeory in this case. so there is no problem of bank confilicting.

anon79554343 · November 1, 2019, 12:35am

Author clearly mentions

For simplicity of presentation, we’ll consider only square matrices whose dimensions are integral multiples of 32 on a side

anon79554343 · November 1, 2019, 12:36am

What happened to the for loops

for (index_t j = 0; j < TILE_DIM; j += BLOCK_ROWS){}

anon15704816 · November 2, 2019, 11:20am

Wow, it has been ages since I last worked on anything CUDA. I think that for-loop is obsolete when you have 32x32 blocks working on 32x32 datatiles because each thread handles exactly one matrix position. The Block scheduler then makes sure that all matrix positions are processed.

anon15704816 · November 2, 2019, 11:21am

My code above works with non-square matrixes, 'though.

anon11708951 · April 20, 2020, 12:28pm

How to use dynamic memory instead of static in this?
Can someone help

xavierw · October 30, 2020, 2:34am

I tried this sample and observed that the “shared memory copy” kernel is ~18% faster than the direct copy. GV100 and TRX6000 both behave this way. Any thoughts to explain this?

Device : Quadro GV100
Matrix size: 1024 1024, Block size: 32 8, Tile size: 32 32
dimGrid: 32 32 1. dimBlock: 32 8 1
Routine Bandwidth (GB/s)
copy 496.48
shared memory copy 590.63
naive transpose 142.74
coalesced transpose 362.32
conflict-free transpose 590.63