I am developing a rather big code with several CUDA kernels. At some point I implemented yet another kernel, but tried running the code even without invoking it. To my surprise, program ran around 10% slower on my GTX260 machine. If I set #if 0 around that unused kernel, the strange time overhead goes away.
I thought there might be something wrong with my card, OS or drivers so I moved to completely different computer. I encountered the very same problem on GTX280.
Therefore I thought I’ll stick to those kernels I already have, but decided to add one more device variable. To my even bigger surprise, program ran again about 10% slower! I don’t even began to use that variable!
What is going on?
Is there some upper limitation on number of global functions, device & constant variables?
Or maybe it is some driver bug?