Is there performance problem in CUDA and Windows?

Hi,

I am using visual studio 2012 and CUDA 8.0 on Windowns 10 PC.
The graphic card is gtx 1060.
When I tested a matrix transpose example in the CUDA samples directory,
I found that the performance is somewhat weird.
The output is as follows:


GPU Device 0: “GeForce GTX 1060” with compute capability 6.1

Device 0: “GeForce GTX 1060”
SM Capability 6.1 detected:
[GeForce GTX 1060] has 10 MP(s) x 128 (Cores/MP) = 1280 (Cores)
Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy , Throughput = 16.6109 GB/s, Time = 0.47032 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 16.9271 GB/s, Time = 0.46154 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 20.0964 GB/s, Time = 0.38875 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 12.4843 GB/s, Time = 0.62579 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 4.9152 GB/s, Time = 1.58944 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 5.4218 GB/s, Time = 1.44093 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 6.5340 GB/s, Time = 1.19566 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 4.9574 GB/s, Time = 1.57594 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed


As far as I know, the “transpose naive” should show the worst throughput, but shows the best in the test.
Is there anyone who knows the reason and how to solve?

Are you running the debug build?

Run the release build. Never evaluate GPU performance based on a debug build.

Thank you very much.
I got expected results with the release build.