CUDA 4.0 -arch sm_20 39% slower Tesla C2050 ECC on __device__ function PTX duplicated?

A compute intensive kernel compiled with nvcc
is about 40% slower when compiled with -arch sm_20
than when compiled with -arch sm_13

The kernel calls device inline void runprog(…)

Comparing the .ptx files produced with and without -arch sm_20
shows the one with -arch sm_20 is 39% bigger and appears to
contain ptx assembler code for runprog twice.

I thought the driver might be doing some JIT/re-compilation etc
the first time the kernel was loaded, but I have run the kernel ten times
and each time it takes approx 428.25 (ms).

I have tried -gencode=arch=compute_20,code=sm_20
but this makes no difference.

I tried searching but failed to find an explaination

Bill

ps: CUDA compiler version 4.0

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_May_12_11:09:45_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221

Command lines tried:
setenv sdk_inc /usr/local/cuda/SDK/C/common/inc/

/usr/local/cuda/bin/nvcc -o gpu_gp_kernel.o -c gpu_gp_kernel.cu
-I. -I$sdk_inc
-DUNIX -keep
-gencode=arch=compute_20,code=sm_20

JIT-compilation is indeed taking place, but for the sm_13 compiled version since the Tesla C2050 can only execute compute capability 2.0 binaries. So in order to look at the real code difference, you would first need to produce a binary compiler for arch=compute_13, code=sm_20. This should again be 39% slower than the sm_20-compiled code. Then you could run both binaries through [font=“Courier New”]cuobjdum -sass[/font] and look at the differences (the code will probably look quite different, so you might first need to figure out what they have in common).

However before going through all this, I’d first try to see how CUDA 4.2 does on the code. There might have been improvements on compute capability 2.x code generation since CUDA 4.0 bringing it up to the same speed.

Dear Tera,
Thank you for your prompt reply. Unfortunately I am still confused.
Why if the driver is spending time converting sm_13 code does the sm_13 code
kernel run faster?

Also does the JIT compilation of sm_13 code (for the Tesla C2050) take place each
time the kernel is launched? I thought it would only happen on the first launch.

Thanks again

Bill

It’s not clear to me what is being compared how, so I will limit myself to general remarks. Since CUDA 4.0, code for platforms > sm_1x is being processed by a new NVVM compiler based on LLVM, while the old Open64 compiler continues to be used when compiling for sm_1x. Although a lot of work went into the NVVM compiler to provide, at minimum, performance parity with Open64, some performance regressions did occur in the course of the switch, the vast majority of which have already been addressed in subsequent releases. Thus I concur with tera’s recommendation to try the CUDA 4.2 toolchain, which is the latest released (feel free to also try the CUDA 5.0 preview available to registered developers if so inclined).

If your source code contains uses of the “volatile” qualifier not strictly needed for functional correctness (i.e. there are used of volatile put in place purely to take advantage of certain artifacts in the Open64 compiler that led to a reduction in register pressure with volatile), please remove all such instances.

Please note that, independent of compiler, certain optimizations that were possible on sm_1x are not longer possible on sm_2x and beyond. One common case is the handling of pointers on 64-bit platforms. CUDA requires the type comptability of host and device types, thus on a 64-bit host all pointers are 64 bit. However, when the compiler targets sm_1x is assured that no sm_1x device has more than 4GB of device memory, therefore 64-bit pointer computations can be aggressively optimized to 32-bit operations. sm_2x and sm_3x devices on the other had do come with > 4 GB of memory, so most of those “pointer squashing” optimization are disabled (some are still possible, for example if a pointer is exclusively used to address shared memory and a memory-space specific pointer is used).

If, after updating the toolchain and possibly cleaning up instances of volatile, you still oberserve a significant performance decrease that can be clearly attributed to the compiler, I would suggest filing a bug against the compiler, attaching a self-contained repro case.