Is there performance problem in CUDA and Windows?

hjkang · March 22, 2017, 2:18am

Hi,

I am using visual studio 2012 and CUDA 8.0 on Windowns 10 PC.
The graphic card is gtx 1060.
When I tested a matrix transpose example in the CUDA samples directory,
I found that the performance is somewhat weird.
The output is as follows:

GPU Device 0: “GeForce GTX 1060” with compute capability 6.1

Device 0: “GeForce GTX 1060”
SM Capability 6.1 detected:
[GeForce GTX 1060] has 10 MP(s) x 128 (Cores/MP) = 1280 (Cores)
Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy , Throughput = 16.6109 GB/s, Time = 0.47032 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 16.9271 GB/s, Time = 0.46154 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 20.0964 GB/s, Time = 0.38875 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 12.4843 GB/s, Time = 0.62579 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 4.9152 GB/s, Time = 1.58944 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 5.4218 GB/s, Time = 1.44093 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 6.5340 GB/s, Time = 1.19566 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 4.9574 GB/s, Time = 1.57594 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

As far as I know, the “transpose naive” should show the worst throughput, but shows the best in the test.
Is there anyone who knows the reason and how to solve?

Robert_Crovella · March 22, 2017, 2:56am

Are you running the debug build?

Run the release build. Never evaluate GPU performance based on a debug build.

hjkang · March 22, 2017, 4:00am

Thank you very much.
I got expected results with the release build.

Topic		Replies	Views
Matrix Transpose on Titan X CUDA Programming and Performance	1	506	December 23, 2016
Matrix transpose slower using shared memory CUDA Programming and Performance	5	1016	August 7, 2015
Computation time of GTX 860M Announcements	1	1650	January 9, 2015
Help me... Cuda program execution is slower than CPU...Did I miss any settings?? CUDA Programming and Performance	5	1194	September 24, 2015
CUDA 5.0 (Decode video using NVCUVID) and Performance CUDA Programming and Performance	2	3555	November 8, 2012
CUDA on Windows much slower than on linux CUDA Programming and Performance	5	3522	January 26, 2013
CUDA problem with eGPU CUDA Setup and Installation	2	839	May 10, 2021
CUDA-OPENCV : low performances instead of high performances CUDA Programming and Performance	0	887	April 13, 2016
speed not stable,and performance lost Maybe a HUGE bug CUDA Programming and Performance	6	9943	November 29, 2007
CUDA performance on Linux CUDA, Ubuntu, x64 CUDA Programming and Performance	0	2602	December 18, 2009

Is there performance problem in CUDA and Windows?

Related topics