I have the same problem. Speed with Runtime 4.1 is 3x slower then in last version of Runtime. GPU M540, driver 286.16. What can I do about it, besides using an old toolkit ?
Have you checked the 4.1 kernel reaches the same occupancy? Chances are 4.1 uses more registers per kernel. A [font=“Courier New”]launch_bounds[/font] directive to tell the compiler about the intended execution configuration would fix this. Check appendix B.18 of the Programming Guide.
As mfatica mentioned, please remove any and all uses of volatile that are not strictly needed for correctness (in particular any uses of volatile intended to manipulate register pressure with older versions of CUDA).
Are you able to narrow down why the app is running more slowly? In particular it would be useful to determine whether it is a code generation issue (e.g. much increased register pressure causing spilling) or a driver-related issue. Does the app behave functionally correct? I assume you have verified that the installation of CUDA 4.1 and the matching driver was successful?
If everything points to CUDA 4.1 as the cause for the app’s slowdown, I would suggest filing a bug (there is a link on the registered developer website for that). Please attach a self-contained repro case that demonstrates the problem. Thanks.
We run into similar problem in our QCD codes. The ispection of assembler output shows that CUDA 4.0 produces LD.E.128 instructions for loading the double precision complex number while cuda 4.1 produces twice as many LD.E.64 instructions. Similar behaviour and a workaround was posted in some recent thread.
There have been reports of vectorization issues (i.e. the lack of vectorization) with CUDA 4.1. To ensure that all instances of regression can be looked into by the compiler team, I would encourage you to file a bug, attaching self-contained repro code.
There are indications that vectorization sometimes does not occur because the compiler diagnoses possible aliasing. Based on that, adding the restrict qualifier to all pointer arguments in a function may improve the situation. This requires that all objects pointed to are indeed not aliased, in particular not identical or overlapping. Please refer to section B.2.4 of the CUDA C Programming Guide.