Matrix Transpose on Titan X

I have found strange behavior of Matrix Transpose implementation provided with CUDA 7.5 Toolkit.

  1. I execute it on NVIDIA GeForce Titan X card and found following results that shows the simple copy version of the transpose is the fastest. 2) Also, I got different results while running from command line and from Nsight.

Can anyone explain about these two issues?

Running from Command Line:

Transpose Starting…

GPU Device 0: “GeForce GTX TITAN X” with compute capability 5.2

Device 0: “GeForce GTX TITAN X”
SM Capability 5.2 detected:
[GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
Compute performance scaling factor = 1.00

Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16

transpose simple copy , Throughput = 293.9895 GB/s, Time = 0.00664 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 252.2844 GB/s, Time = 0.00774 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 66.5785 GB/s, Time = 0.02934 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 228.1688 GB/s, Time = 0.00856 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 250.7298 GB/s, Time = 0.00779 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 261.0125 GB/s, Time = 0.00748 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 244.3850 GB/s, Time = 0.00799 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 190.7706 GB/s, Time = 0.01024 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

Running from Nsight:

Transpose Starting…

GPU Device 0: “GeForce GTX TITAN X” with compute capability 5.2

Device 0: “GeForce GTX TITAN X”
SM Capability 5.2 detected:
[GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
Compute performance scaling factor = 1.00

Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16

transpose simple copy , Throughput = 30.1667 GB/s, Time = 0.06474 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 24.4021 GB/s, Time = 0.08004 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 21.9152 GB/s, Time = 0.08912 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 16.6139 GB/s, Time = 0.11756 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 6.4687 GB/s, Time = 0.30194 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 6.3345 GB/s, Time = 0.30833 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 8.0173 GB/s, Time = 0.24361 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 6.4711 GB/s, Time = 0.30182 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

Regarding the difference between command line and nsight, my guess is that from the command line you are running the release version, and from nsight you are running the debug project. This will have a significant impact on performance.

also, when I run the transpose sample code from CUDA 7.5, I am getting a 1024x1024 default transpose, not 512x512. Have you made some modifications to the code, or are you running with some command line parameters that you haven’t shown?