I’ve recently tried to compile my CUDA code with nvcc 2.0beta and noticed a slow-down in performance (from 245 GFLOP/s to 190 GFLOP/s).
The kernel is relatively complex (lots of book-keeping), and therefore it uses quite a few registers (58). However, I use -maxrregcount=32 compiler option in order to better utilise the device, since some of the data can be stored in lmem and retrieved from time to time without affecting the performance. This is indeed true with CUDA v1.1 (but not with CUDA v1.0), where I reach the performance I expect, and this made me to conclude that CUDA v1.1 is good at optimising register usage.
I decided to try CUDA v2.0 beta, and to my surprise the code was noticeably (190GFLOP/s) compared to CUDA v1.1 (245 GFLOP/s).
At this moment I am a bit at loss, and wondering if anybody has experienced this behaviour and may have some advices?
If of any help, the relevant parts of my kernel can be found @ this link . The full code is unfortunately not yet public, but will be anytime soon.