CUDA v4.1 substantially slower than v4.0

Tried different cards and configurations (Quadro 2000, GTX 580, GTX 580x2, GTX 580x4, GTX 480x4).
Driver 286.16

Same project, source code didn’t change, preferences didn’t change.

The one, i’ve builded with v4.0 toolkit (April 2011) works more then 2.5 faster compare to project builded with the latest v4.1 toolkit.

Any reasonable explanation for this ? What do i miss ? Some new compiler switch or something like that maybe…

Yes. nvcc uses llvm for fermi+ cards for cuda 4.0

I knew it. But nvidia promised 10% faster code, not twice slower.

What can I do about it, besides using an old toolkit ?

If you are using “volatile”, remove them.

File a bug with a small repro so the compiler team can understand the cause of the regression.

Is it possible that we get a choice which compiler to use?

I have the same problem. Speed with Runtime 4.1 is 3x slower then in last version of Runtime. GPU M540, driver 286.16. What can I do about it, besides using an old toolkit ?

Have you checked the 4.1 kernel reaches the same occupancy? Chances are 4.1 uses more registers per kernel. A [font=“Courier New”]launch_bounds[/font] directive to tell the compiler about the intended execution configuration would fix this. Check appendix B.18 of the Programming Guide.

Yes, i`ve tried different combinations of launch_bounds. No effect

As mfatica mentioned, please remove any and all uses of volatile that are not strictly needed for correctness (in particular any uses of volatile intended to manipulate register pressure with older versions of CUDA).

Are you able to narrow down why the app is running more slowly? In particular it would be useful to determine whether it is a code generation issue (e.g. much increased register pressure causing spilling) or a driver-related issue. Does the app behave functionally correct? I assume you have verified that the installation of CUDA 4.1 and the matching driver was successful?

If everything points to CUDA 4.1 as the cause for the app’s slowdown, I would suggest filing a bug (there is a link on the registered developer website for that). Please attach a self-contained repro case that demonstrates the problem. Thanks.

We run into similar problem in our QCD codes. The ispection of assembler output shows that CUDA 4.0 produces LD.E.128 instructions for loading the double precision complex number while cuda 4.1 produces twice as many LD.E.64 instructions. Similar behaviour and a workaround was posted in some recent thread.

There have been reports of vectorization issues (i.e. the lack of vectorization) with CUDA 4.1. To ensure that all instances of regression can be looked into by the compiler team, I would encourage you to file a bug, attaching self-contained repro code.

There are indications that vectorization sometimes does not occur because the compiler diagnoses possible aliasing. Based on that, adding the restrict qualifier to all pointer arguments in a function may improve the situation. This requires that all objects pointed to are indeed not aliased, in particular not identical or overlapping. Please refer to section B.2.4 of the CUDA C Programming Guide.