CUDA v2.0 beta is slower than CUDA v1.1 Is it just temporarily ?

Dear All,

I’ve recently tried to compile my CUDA code with nvcc 2.0beta and noticed a slow-down in performance (from 245 GFLOP/s to 190 GFLOP/s).

The kernel is relatively complex (lots of book-keeping), and therefore it uses quite a few registers (58). However, I use -maxrregcount=32 compiler option in order to better utilise the device, since some of the data can be stored in lmem and retrieved from time to time without affecting the performance. This is indeed true with CUDA v1.1 (but not with CUDA v1.0), where I reach the performance I expect, and this made me to conclude that CUDA v1.1 is good at optimising register usage.

I decided to try CUDA v2.0 beta, and to my surprise the code was noticeably (190GFLOP/s) compared to CUDA v1.1 (245 GFLOP/s).

At this moment I am a bit at loss, and wondering if anybody has experienced this behaviour and may have some advices?

If of any help, the relevant parts of my kernel can be found @ this link . The full code is unfortunately not yet public, but will be anytime soon.


If I were you I would pm the full code to someone from NVIDIA (netllama e.g.) if you can. Then they can maybe find out what causes the difference.

Alternatively you can compile with -keep and check for the differences in the .ptx. It might be that the compiler chooses to put other variables in local memory.

This sounds like a known issue which will be resolved in the final CUDA_2.0 release (which is expected in the near future).

If you are interested, I could supply complete source code, under temporary request not to circulate it. If so, please let me know how can I give it to you.