CUDA 4.0 -arch sm_20 39% slower Tesla C2050 ECC on device function PTX duplicated?

wlangdon · May 21, 2012, 3:11pm

A compute intensive kernel compiled with nvcc
is about 40% slower when compiled with -arch sm_20
than when compiled with -arch sm_13

The kernel calls device inline void runprog(…)

Comparing the .ptx files produced with and without -arch sm_20
shows the one with -arch sm_20 is 39% bigger and appears to
contain ptx assembler code for runprog twice.

I thought the driver might be doing some JIT/re-compilation etc
the first time the kernel was loaded, but I have run the kernel ten times
and each time it takes approx 428.25 (ms).

I have tried -gencode=arch=compute_20,code=sm_20
but this makes no difference.

I tried searching but failed to find an explaination

Bill

ps: CUDA compiler version 4.0

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_May_12_11:09:45_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221

Command lines tried:
setenv sdk_inc /usr/local/cuda/SDK/C/common/inc/

/usr/local/cuda/bin/nvcc -o gpu_gp_kernel.o -c gpu_gp_kernel.cu
-I. -I$sdk_inc
-DUNIX -keep
-gencode=arch=compute_20,code=sm_20

tera · May 21, 2012, 4:22pm

JIT-compilation is indeed taking place, but for the sm_13 compiled version since the Tesla C2050 can only execute compute capability 2.0 binaries. So in order to look at the real code difference, you would first need to produce a binary compiler for arch=compute_13, code=sm_20. This should again be 39% slower than the sm_20-compiled code. Then you could run both binaries through [font=“Courier New”]cuobjdum -sass[/font] and look at the differences (the code will probably look quite different, so you might first need to figure out what they have in common).

However before going through all this, I’d first try to see how CUDA 4.2 does on the code. There might have been improvements on compute capability 2.x code generation since CUDA 4.0 bringing it up to the same speed.

wlangdon · May 21, 2012, 5:16pm

Dear Tera,
Thank you for your prompt reply. Unfortunately I am still confused.
Why if the driver is spending time converting sm_13 code does the sm_13 code
kernel run faster?

Also does the JIT compilation of sm_13 code (for the Tesla C2050) take place each
time the kernel is launched? I thought it would only happen on the first launch.

Thanks again

Bill

njuffa · May 21, 2012, 7:06pm

It’s not clear to me what is being compared how, so I will limit myself to general remarks. Since CUDA 4.0, code for platforms > sm_1x is being processed by a new NVVM compiler based on LLVM, while the old Open64 compiler continues to be used when compiling for sm_1x. Although a lot of work went into the NVVM compiler to provide, at minimum, performance parity with Open64, some performance regressions did occur in the course of the switch, the vast majority of which have already been addressed in subsequent releases. Thus I concur with tera’s recommendation to try the CUDA 4.2 toolchain, which is the latest released (feel free to also try the CUDA 5.0 preview available to registered developers if so inclined).

If your source code contains uses of the “volatile” qualifier not strictly needed for functional correctness (i.e. there are used of volatile put in place purely to take advantage of certain artifacts in the Open64 compiler that led to a reduction in register pressure with volatile), please remove all such instances.

Please note that, independent of compiler, certain optimizations that were possible on sm_1x are not longer possible on sm_2x and beyond. One common case is the handling of pointers on 64-bit platforms. CUDA requires the type comptability of host and device types, thus on a 64-bit host all pointers are 64 bit. However, when the compiler targets sm_1x is assured that no sm_1x device has more than 4GB of device memory, therefore 64-bit pointer computations can be aggressively optimized to 32-bit operations. sm_2x and sm_3x devices on the other had do come with > 4 GB of memory, so most of those “pointer squashing” optimization are disabled (some are still possible, for example if a pointer is exclusively used to address shared memory and a memory-space specific pointer is used).

If, after updating the toolchain and possibly cleaning up instances of volatile, you still oberserve a significant performance decrease that can be clearly attributed to the compiler, I would suggest filing a bug against the compiler, attaching a self-contained repro case.

Topic		Replies	Views
-arch sm_13 vs -arch sm_20 (sm_20 slower on C2050) CUDA Programming and Performance	21	7329	December 21, 2010
code made for Fermi runs slower on 465 than code made for CC=1.3 compiler fails to properly compile CUDA Programming and Performance	2	4827	March 20, 2012
Code 4 times slower with "arch=sm_20" CUDA Programming and Performance	39	55969	June 15, 2010
GTX480 faster if compiled with sm12 istead of sm20 CUDA Programming and Performance	0	10786	August 3, 2010
Coda Generation CUDA Programming and Performance	0	3654	May 5, 2011
Register usage difference between sm_13 and sm_20 Many more registers used when compiling for sm_20 CUDA Programming and Performance	6	10903	August 11, 2010
texture memory faster if compiled with sm13 istead of sm20 CUDA Programming and Performance	0	909	August 10, 2010
why slower with flags "-arch; sm_20" CUDA Programming and Performance	8	1266	September 9, 2011
Wrong results with -arch=sm_20 on a compute capability 2.0 GPU -arch=sm_13 and -arch=sm_20 does not CUDA Programming and Performance	5	10624	April 16, 2011
Execution time different between arch=sm_11 and sm_20? (w/ Cmake) CUDA Programming and Performance	2	2171	April 25, 2012

CUDA 4.0 -arch sm_20 39% slower Tesla C2050 ECC on __device__ function PTX duplicated?

Related topics

CUDA 4.0 -arch sm_20 39% slower Tesla C2050 ECC on device function PTX duplicated?