Matrix transpose slower using shared memory

I’m running the transpose example in the cuda samples of the 6.5 Toolkit (I’ve also run the sample code from the ParallelForAll bog post ‘Efficient Matrix Tanspose’) on my laptop pc that has a GT 620M (Fermi) card. The results I see for the ‘optimized’ implementations using shared memory have lower throughput than the ‘naive’ implementations and don’t match the throughput of the ‘simple copy’ implementation. Below are my outputs:

Device 0: “GeForce GT 620M”
SM Capability 2.1 detected:
[GeForce GT 620M] has 2 MP(s) x 48 (Cores/MP) = 96 (Cores)
Compute performance scaling factor = 2.00

Matrix size: 512x512 (16x16 tiles), tile size: 32x32, block size: 32x8

transpose simple copy , Throughput = 4.8768 GB/s, Time = 0.40049 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 2.2575 GB/s, Time = 0.86518 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 2.2181 GB/s, Time = 0.88053 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 2.6276 GB/s, Time = 0.74331 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 3.1283 GB/s, Time = 0.62434 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 3.1307 GB/s, Time = 0.62385 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 3.3956 GB/s, Time = 0.57519 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 2.9128 GB/s, Time = 0.67053 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

I’ve also run these examples on another (older) card: GTx 550 Ti and see similar results. I know these are older cards, but I am wondering why the optimizations using shared memory are not optimal.

Has this been reported in another post? I could not find any such entry.

What gives - why don’t the shared implementations match the simple copy throughput?

Thanks

I’ve run those transpose codes on “older” fermi devices, and the transposeCoalesced code should definitely run faster than the simple copy, and it does so for me.

Thanks, txbob.

My outputs above are for TILE_DIM=32, BLOCK_ROWS=8, corresponding to the configuration used in the ParallelForAll blog post. Below are the configs used as in the original CudaSample, namely, TILE_DIM=16, BLOCK_ROWS=16. As above, the naive implementation using global memory has higher throughput than the one using shared memory, and the shared memory impl does not match out to the copy throughput. I notice that the shared_memory copy throughput is lower than the simple-copy throughput. My understanding is that these should be the same. Does this imply that the shared memory impls are slower because of the overhead associated with using shared memory and the required synchronization barrier __syncthreads()?

Transpose Starting…

GPU Device 0: “GeForce GT 620M” with compute capability 2.1

Device 0: “GeForce GT 620M”
SM Capability 2.1 detected:
[GeForce GT 620M] has 2 MP(s) x 48 (Cores/MP) = 96 (Cores)
Compute performance scaling factor = 2.00

Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16

transpose simple copy , Throughput = 8.0010 GB/s, Time = 0.24411 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 3.9424 GB/s, Time = 0.49542 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 7.1266 GB/s, Time = 0.27406 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 4.5509 GB/s, Time = 0.42917 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 4.5130 GB/s, Time = 0.43278 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 4.5142 GB/s, Time = 0.43267 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 5.2559 GB/s, Time = 0.37161 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 3.4687 GB/s, Time = 0.56307 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed
Press any key to continue . . .

I have not looked at the source code but if it is using parametrization, the parameters might not have been tweaked sufficiently for very low-end GPUs. That is just a guess, you can try to find out what is going on with the help of the CUDA profiler if it really matters. Below is data from my machine with a GPU similar to yours, also using CUDA 6.5. The transpose through shared memory is just a tad slower than the simple copy, and significantly faster than the naive method, which is as desired and expected.

Transpose Starting...

GPU Device 0: "Quadro 2000" with compute capability 2.1

> Device 0: "Quadro 2000"
> SM Capability 2.1 detected:
> [Quadro 2000] has 4 MP(s) x 48 (Cores/MP) = 192 (Cores)
> Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 27.0793 GB/s, Time = 0.28850 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 25.3167 GB/s, Time = 0.30859 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive             , Throughput = 15.5574 GB/s, Time = 0.50217 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced         , Throughput = 22.4956 GB/s, Time = 0.34729 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized         , Throughput = 24.2513 GB/s, Time = 0.32215 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained    , Throughput = 24.5523 GB/s, Time = 0.31820 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained      , Throughput = 25.0806 GB/s, Time = 0.31150 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal          , Throughput = 21.3932 GB/s, Time = 0.36519 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

Thx, njuffa. I was looking at the times under the ‘debug’ configuration instead of ‘release’. After switching this and rerunning, I indeed see better times:

Transpose Starting...

GPU Device 0: "GeForce GTX 550 Ti" with compute capability 2.1

> Device 0: "GeForce GTX 550 Ti"
> SM Capability 2.1 detected:
> [GeForce GTX 550 Ti] has 4 MP(s) x 48 (Cores/MP) = 192 (Cores)
> Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 47.4704 GB/s, Time = 0.16458 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 43.4320 GB/s, Time = 0.17988 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive             , Throughput = 24.1361 GB/s, Time = 0.32369 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced         , Throughput = 36.4295 GB/s, Time = 0.21446 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized         , Throughput = 40.7325 GB/s, Time = 0.19180 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained    , Throughput = 41.6359 GB/s, Time = 0.18764 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained      , Throughput = 42.0056 GB/s, Time = 0.18599 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal          , Throughput = 34.1558 GB/s, Time = 0.22873 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

I also reran the code from the blog post and also see better times as well:

Device : GeForce GTX 550 Ti
Matrix size: 1024 1024, Block size: 32 8, Tile size: 32 32
dimGrid: 32 32 1. dimBlock: 32 8 1
                  Routine         Bandwidth (GB/s)
                     copy               74.02
       shared memory copy               81.38
          naive transpose               17.48
      coalesced transpose               25.40
  conflict-free transpose               81.42

Thanks your your help!

You would never want to use a debug build to measure performance.