Upper limit on kernel/global vars count?

I am developing a rather big code with several CUDA kernels. At some point I implemented yet another kernel, but tried running the code even without invoking it. To my surprise, program ran around 10% slower on my GTX260 machine. If I set #if 0 around that unused kernel, the strange time overhead goes away.
I thought there might be something wrong with my card, OS or drivers so I moved to completely different computer. I encountered the very same problem on GTX280.

Therefore I thought I’ll stick to those kernels I already have, but decided to add one more device variable. To my even bigger surprise, program ran again about 10% slower! I don’t even began to use that variable!

What is going on?
Is there some upper limitation on number of global functions, device & constant variables?
Or maybe it is some driver bug?

Update: I run a Visual Profiler to compare those two programs.
In the second case I can clearly see huge white gap between kernel invocations (on GPU Time Width Plot). There is something strange going on, not because of my own code. Any hints what it may be?

To visualise what I am talking about I am attaching a small screenshot from my visual profiler.
There is no heavy CPU computation between any of kernel calls. Besides, if it was something in the code, it would repeat itself in every iteration of my loop, while - as you can see - it occurs sparsly.

Hey, any chance to get a response to my problem?
Or maybe something is not clear for you?

Hello PDan,

I have noticed the same thing. Have you found out anything more about this issue?

One possibility could be that since the Visual Profiler only targets one MP, that MP may have been idle while the other ones finished up.