Upper limit on kernel/global vars count?

Cygnus_X1 · July 9, 2009, 11:03am

I am developing a rather big code with several CUDA kernels. At some point I implemented yet another kernel, but tried running the code even without invoking it. To my surprise, program ran around 10% slower on my GTX260 machine. If I set #if 0 around that unused kernel, the strange time overhead goes away.
I thought there might be something wrong with my card, OS or drivers so I moved to completely different computer. I encountered the very same problem on GTX280.

Therefore I thought I’ll stick to those kernels I already have, but decided to add one more device variable. To my even bigger surprise, program ran again about 10% slower! I don’t even began to use that variable!

What is going on?
Is there some upper limitation on number of global functions, device & constant variables?
Or maybe it is some driver bug?

Cygnus_X1 · July 9, 2009, 12:53pm

Update: I run a Visual Profiler to compare those two programs.
In the second case I can clearly see huge white gap between kernel invocations (on GPU Time Width Plot). There is something strange going on, not because of my own code. Any hints what it may be?

Cygnus_X1 · July 9, 2009, 7:10pm

To visualise what I am talking about I am attaching a small screenshot from my visual profiler.
There is no heavy CPU computation between any of kernel calls. Besides, if it was something in the code, it would repeat itself in every iteration of my loop, while - as you can see - it occurs sparsly.

Cygnus_X1 · July 16, 2009, 10:50am

Hey, any chance to get a response to my problem?
Or maybe something is not clear for you?

RogerDahl · September 7, 2009, 7:53pm

Hello PDan,

I have noticed the same thing. Have you found out anything more about this issue?

One possibility could be that since the Visual Profiler only targets one MP, that MP may have been idle while the other ones finished up.

Roger