CUDA v4.1 substantially slower than v4.0

v1vas · February 3, 2012, 12:18am

Tried different cards and configurations (Quadro 2000, GTX 580, GTX 580x2, GTX 580x4, GTX 480x4).
Driver 286.16

Same project, source code didn’t change, preferences didn’t change.

The one, i’ve builded with v4.0 toolkit (April 2011) works more then 2.5 faster compare to project builded with the latest v4.1 toolkit.

Any reasonable explanation for this ? What do i miss ? Some new compiler switch or something like that maybe…

struct · February 3, 2012, 12:20am

Yes. nvcc uses llvm for fermi+ cards for cuda 4.0

v1vas · February 3, 2012, 1:00am

I knew it. But nvidia promised 10% faster code, not twice slower.

What can I do about it, besides using an old toolkit ?

mfatica · February 3, 2012, 1:25am

If you are using “volatile”, remove them.

File a bug with a small repro so the compiler team can understand the cause of the regression.

cbuchner1 · February 3, 2012, 10:31am

Is it possible that we get a choice which compiler to use?

pemolux · February 7, 2012, 7:28am

I have the same problem. Speed with Runtime 4.1 is 3x slower then in last version of Runtime. GPU M540, driver 286.16. What can I do about it, besides using an old toolkit ?

tera · February 7, 2012, 12:57pm

Have you checked the 4.1 kernel reaches the same occupancy? Chances are 4.1 uses more registers per kernel. A [font=“Courier New”]launch_bounds[/font] directive to tell the compiler about the intended execution configuration would fix this. Check appendix B.18 of the Programming Guide.

pemolux · February 7, 2012, 5:27pm

Yes, i`ve tried different combinations of launch_bounds. No effect

njuffa · February 7, 2012, 5:37pm

As mfatica mentioned, please remove any and all uses of volatile that are not strictly needed for correctness (in particular any uses of volatile intended to manipulate register pressure with older versions of CUDA).

Are you able to narrow down why the app is running more slowly? In particular it would be useful to determine whether it is a code generation issue (e.g. much increased register pressure causing spilling) or a driver-related issue. Does the app behave functionally correct? I assume you have verified that the installation of CUDA 4.1 and the matching driver was successful?

If everything points to CUDA 4.1 as the cause for the app’s slowdown, I would suggest filing a bug (there is a link on the registered developer website for that). Please attach a self-contained repro case that demonstrates the problem. Thanks.

fillmore · February 12, 2012, 7:35pm

We run into similar problem in our QCD codes. The ispection of assembler output shows that CUDA 4.0 produces LD.E.128 instructions for loading the double precision complex number while cuda 4.1 produces twice as many LD.E.64 instructions. Similar behaviour and a workaround was posted in some recent thread.

njuffa · February 12, 2012, 9:13pm

There have been reports of vectorization issues (i.e. the lack of vectorization) with CUDA 4.1. To ensure that all instances of regression can be looked into by the compiler team, I would encourage you to file a bug, attaching self-contained repro code.

There are indications that vectorization sometimes does not occur because the compiler diagnoses possible aliasing. Based on that, adding the restrict qualifier to all pointer arguments in a function may improve the situation. This requires that all objects pointed to are indeed not aliased, in particular not identical or overlapping. Please refer to section B.2.4 of the CUDA C Programming Guide.

Topic		Replies	Views
Slow perfomance Runtime 4.1 CUDA Programming and Performance	2	695	February 9, 2012
CUDA 4.1 RC2 is now available CUDA Programming and Performance	11	2954	December 14, 2011
CUDA 4.0 -arch sm_20 39% slower Tesla C2050 ECC on __device__ function PTX duplicated? CUDA Programming and Performance	3	1713	May 21, 2012
cuda 3.2 slower than cuda 2.0 ? CUDA Programming and Performance	11	4345	November 3, 2010
code generated with cuda 5.0 is slower than with cuda 4.2 !? CUDA Programming and Performance	9	1525	March 28, 2013
Why is CUDA 4.1RC about 10-15% slower than 4.0? CUDA Programming and Performance	15	2224	December 19, 2011
CUDA 4.1 vs. 3.2 register allocation... CUDA Programming and Performance	6	1493	April 24, 2012
Backward Compatibility Issues with CUDA on Older Nvidia Architectures vs. OpenCL CUDA Programming and Performance cuda , opencl	1	302	June 7, 2024
Toolkit on Customer Computer CUDA Programming and Performance	10	798	September 24, 2020
My GPU Became Slower... after 1 month of not testing cuda CUDA Programming and Performance	18	12162	August 23, 2010

CUDA v4.1 substantially slower than v4.0

Related topics