I have found strange behavior of Matrix Transpose implementation provided with CUDA 7.5 Toolkit.
- I execute it on NVIDIA GeForce Titan X card and found following results that shows the simple copy version of the transpose is the fastest. 2) Also, I got different results while running from command line and from Nsight.
Can anyone explain about these two issues?
Running from Command Line:
Transpose Starting…
GPU Device 0: “GeForce GTX TITAN X” with compute capability 5.2
Device 0: “GeForce GTX TITAN X”
SM Capability 5.2 detected:
[GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
Compute performance scaling factor = 1.00
Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16
transpose simple copy , Throughput = 293.9895 GB/s, Time = 0.00664 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 252.2844 GB/s, Time = 0.00774 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 66.5785 GB/s, Time = 0.02934 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 228.1688 GB/s, Time = 0.00856 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 250.7298 GB/s, Time = 0.00779 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 261.0125 GB/s, Time = 0.00748 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 244.3850 GB/s, Time = 0.00799 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 190.7706 GB/s, Time = 0.01024 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed
Running from Nsight:
Transpose Starting…
GPU Device 0: “GeForce GTX TITAN X” with compute capability 5.2
Device 0: “GeForce GTX TITAN X”
SM Capability 5.2 detected:
[GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
Compute performance scaling factor = 1.00
Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16
transpose simple copy , Throughput = 30.1667 GB/s, Time = 0.06474 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 24.4021 GB/s, Time = 0.08004 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 21.9152 GB/s, Time = 0.08912 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 16.6139 GB/s, Time = 0.11756 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 6.4687 GB/s, Time = 0.30194 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 6.3345 GB/s, Time = 0.30833 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 8.0173 GB/s, Time = 0.24361 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 6.4711 GB/s, Time = 0.30182 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed