Matrix Transpose on Titan X

DrAyazKhan · December 22, 2016, 12:21pm

I have found strange behavior of Matrix Transpose implementation provided with CUDA 7.5 Toolkit.

I execute it on NVIDIA GeForce Titan X card and found following results that shows the simple copy version of the transpose is the fastest. 2) Also, I got different results while running from command line and from Nsight.

Can anyone explain about these two issues?

Running from Command Line:

Transpose Starting…

GPU Device 0: “GeForce GTX TITAN X” with compute capability 5.2

Device 0: “GeForce GTX TITAN X”
SM Capability 5.2 detected:
[GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
Compute performance scaling factor = 1.00

Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16

transpose simple copy , Throughput = 293.9895 GB/s, Time = 0.00664 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 252.2844 GB/s, Time = 0.00774 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 66.5785 GB/s, Time = 0.02934 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 228.1688 GB/s, Time = 0.00856 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 250.7298 GB/s, Time = 0.00779 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 261.0125 GB/s, Time = 0.00748 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 244.3850 GB/s, Time = 0.00799 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 190.7706 GB/s, Time = 0.01024 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

Running from Nsight:

Transpose Starting…

GPU Device 0: “GeForce GTX TITAN X” with compute capability 5.2

Device 0: “GeForce GTX TITAN X”
SM Capability 5.2 detected:
[GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
Compute performance scaling factor = 1.00

Matrix size: 512x512 (32x32 tiles), tile size: 16x16, block size: 16x16

transpose simple copy , Throughput = 30.1667 GB/s, Time = 0.06474 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 24.4021 GB/s, Time = 0.08004 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 21.9152 GB/s, Time = 0.08912 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 16.6139 GB/s, Time = 0.11756 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 6.4687 GB/s, Time = 0.30194 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 6.3345 GB/s, Time = 0.30833 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 8.0173 GB/s, Time = 0.24361 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 6.4711 GB/s, Time = 0.30182 ms, Size = 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed

Robert_Crovella · December 23, 2016, 5:36am

Regarding the difference between command line and nsight, my guess is that from the command line you are running the release version, and from nsight you are running the debug project. This will have a significant impact on performance.

also, when I run the transpose sample code from CUDA 7.5, I am getting a 1024x1024 default transpose, not 512x512. Have you made some modifications to the code, or are you running with some command line parameters that you haven’t shown?

Topic		Replies	Views
Matrix transpose slower using shared memory CUDA Programming and Performance	5	1018	August 7, 2015
Is there performance problem in CUDA and Windows? CUDA Programming and Performance	2	522	March 22, 2017
transpose example, SDK 3.2 CUDA Programming and Performance	4	7473	March 15, 2011
Doubling the speed of the SDK transpose CUDA Programming and Performance	16	6302	October 15, 2008
fail the transpose program in cuda examples ubuntu 14.04 cuda-7.5 nvidia-driver:352.39 CUDA Setup and Installation	5	979	September 3, 2016
why am I not seeing bank conflict effects on a gtx 285? CUDA Programming and Performance	3	1571	April 2, 2010
Transpose performance CUDA Programming and Performance	0	2351	July 11, 2008
Transpose kernel slower on GTX280 vs 8800GTX? CUDA Programming and Performance	3	2594	October 4, 2008
Pascal Titan X benchmark thread CUDA Programming and Performance	19	4637	August 12, 2016
Transpose example, strange dim dependent lagg.. CUDA Programming and Performance	24	12259	October 25, 2009

Matrix Transpose on Titan X

Running from Command Line:

Running from Nsight:

Related topics