Hi,
I am using visual studio 2012 and CUDA 8.0 on Windowns 10 PC.
The graphic card is gtx 1060.
When I tested a matrix transpose example in the CUDA samples directory,
I found that the performance is somewhat weird.
The output is as follows:
GPU Device 0: “GeForce GTX 1060” with compute capability 6.1
Device 0: “GeForce GTX 1060”
SM Capability 6.1 detected:
[GeForce GTX 1060] has 10 MP(s) x 128 (Cores/MP) = 1280 (Cores)
Compute performance scaling factor = 1.00
Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16
transpose simple copy , Throughput = 16.6109 GB/s, Time = 0.47032 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 16.9271 GB/s, Time = 0.46154 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 20.0964 GB/s, Time = 0.38875 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 12.4843 GB/s, Time = 0.62579 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 4.9152 GB/s, Time = 1.58944 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 5.4218 GB/s, Time = 1.44093 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 6.5340 GB/s, Time = 1.19566 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 4.9574 GB/s, Time = 1.57594 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed
As far as I know, the “transpose naive” should show the worst throughput, but shows the best in the test.
Is there anyone who knows the reason and how to solve?