Deminishing performance?

Hey all,

I have a kernel which I call multiple times (indefinitely until the program terminates) - for testing purposes (to attempt and diagnose what’s causing this performance problem) - I’ve made it so this kernel gets identical inputs, and I’ve checked that it’s giving me identical outputs each time…

On average, this kernel takes ~4ms to execute - however after an arbitrary number of executions (it’s happened after 2-3, sometimes after 30-1000+, sometimes never) it’s execution time runs up to 9-12ms (again, identical inputs/outputs, and no errors) - and consistently runs at it’s (2-3 times) slower speed for the rest of the processes life span.

To re-iterate, the code paths taken in both the kernel, and CPU (Driver API) code are identical, and receive/output identical values each time I call this function - all running in the same CPU thread & CUDA device.

I should note, each time I call this kernel - I push a context (which I create at the beginning of the program), allocate new memory, transfer the memory from host->device, synchronize the async memcpy’s, start event timer, launch the kernel, synchronize, stop event timer, clean up memory & pop context. Note: The timings I’m referring to are between the start and stop events.

My question is, what can be causing these unpredictable performance degradations? And more importantly, what can I do to avoid them entirely?


CUDA Version: 2.0
Operating System: Windows Vista (32bit)
Card: Quadro FX 570
Driver Version: 177.84

P.S. I’m also calling other kernels both before and after this kernel - however I synchronize both before and after this kernel - so I can’t see how those other kernels would affect the performance of this kernel?

Are you running display on the card?

I am - however there’s nothing but Visual Studio running in this case. (I even tried disabling Aero, same results - not that I’d expect something as basic as Aero to interfere with kernel timings.) Additionally, I don’t see this behavior in any of my other kernels (although none of them are as time consuming as this one).

I’ve also managed to isolate the test case, so I’m now only running one kernel (nothing before/after it) - I still see identical results.

The kernel uses 0lmem, 5840smem, and 11 registers.

Edit: Would I be right in assuming that if some other application was using the GPU via OpenGL/D3D - that they could be consuming smem? - thus only allowing 1 block to run at a time (if they used enough smem), as opposed to 2 blocks assuming the full 16kb smem was free?

Second Edit: My assumption clearly isn’t correct - I managed to bring the smem down to 1508 bytes (allowing more than 2 blocks at a time, but introduced 4-way bank conflicts - bringing it up to ~5ms), and it exhibits identical behaviour - after a few iterations, it jumps up to ~12ms.

Have you tried timing the various pieces of what you’re doing individually? I don’t know how the CUDA timer works, I’d suggest using Windows’s QueryPerformanceCounter() and throwing in a lot of cudaThreadSynchronize()'s everywhere. It could be some sort of shortcoming in any one of the cuda calls you make. The kernel itself shouldn’t slow down, and certainly other 3D apps can’t steal smem (although they can slow down overall execution by a lot).

Btw, have you trying running the Visual Profiler and see if any of the counters change? Maybe the kernel isn’t really doing the same work every iteration.

OpenGL/D3d certainly do use shared mem, but that’s NOT used simultaneously with CUDA… like a CUDA kernel evocation, graphics computes get ALL the 16K shared mem for themselves during their compute. Shared mem splitting only happens in CUDA between blocks of the same kernel. (at least currently… it would be possible that later CUDA would allow mixing of multiple kernels or computes like OpenGL simultaneously, but likely that’d need hardware changes since the scheduler would be more complex. More likely, if such features were added, each SM would run its own kernel and you still wouldn’t split shared mem among differing kernels).

Actually, I think it’d be possible to run different kernels on the same SM. You can emulate it easily by putting the two sets of machine code side-by-side in memory, and putting an if() in front.

But it doesn’t happen now. When it does, it will have several benefits. Performance could be better since running kernels with complementary load patterns improves the utilization of resources. And we won’t need the watchdog timer.

As some1 above said, do the profiling thing and make sure your kernel executes the same way during every invocation…

If you are sure, its a problem

You could do 2 things:

  1. File a probem report (start a new thread with the data required for the problem report)
  2. Register as a registered-developer and officialy file a report to NVIDIA straight.
  1. I don’t think it’s a problem with CUDA, it appears to be a problem with how I’m using the Driver API - thorough testing has indicated this happens with ALL of my kernels (…eventually). After it happens, it seem to half the performance of ALL CUDAand even OpenGL related performance (not only kernel calls, but simply device->device, host->device, and device->host memory copies, and even OpenGL rendering times) for either my process.

Note: Starting a new process (keeping the old one active) yields expected performance levels from BOTH processes again, but eventually they both suffer from the same performance degradations after a while.

  1. I’ve tried to register at least 2 times before, never had a response.

So how does everyone else use the Driver API? Do they create new context for each kernel call? or do they create a single context per device on startup, or something else entirely? (As I think I said before, I’m creating one context on startup - and using it for ALL kernel calls throughout the application.)

Again, any insight into what ‘might’ cause this behavior would be greatly appreciated.

That makes it sound even more like it’s a problem with CUDA.

Make a simple app that reproduces the problem, I’ll see if I get the same results and someone from NVIDIA can take a look at it.

I think you’re right - I managed to reproduce the same problem with nVidia’s sample “simpleTexDrv” (the only sample that uses the Driver API afaik) - after some slight modifications that allow me to run the kernel multiple times without restarting the application.

At first, the program takes ~2ms to complete it’s copy - after a while (to speed up the process, I just alt-tabbed, opened up some other apps, let the sample app idle for about 30-60 seconds) it dropped down to 8ms.

[codebox]# Processing time: 2.237365 (ms)

117.17 Mpixels/sec

** Press any key to run again, or Escape to exit **

Processing time: 8.021899 (ms)

32.68 Mpixels/sec

** Press any key to run again, or Escape to exit **[/codebox]

See attached for modified sample code, with VS 2008 (VS9) solution. Simply copy it into the CUDA SDK’s “projects” directory - (copy over the original “simpleTextureDrv” project files into the new directory if you want to use 2005) & run the program.

It’ll re-run the kernel each time you press a key, I was getting ~2ms (on my Quadro FX 570) consistently for a while, then alt-tabbed to Firefox (was about to post that I couldn’t reproduce the problem) - went back to the sample app, pressed a key to re-run the kernel, and got 8ms. (It probably takes ~30-60 seconds of idle time.)

Note: Holding down any key to constantly run the kernel, eventually crashes after ~10 seconds with “The program ‘[2520] simpleTextureDrv.exe: Native’ has exited with code 1 (0x1).”


It’s happening to me too. Exactly how you described it. After a few minutes performance drops down 2x. I open another instance of the app, and the first instance goes back to normal. Another few minutes, and both are 2x slow again. (Third instance fixes everything again for a bit, and so on.)

Vista x64
CUDA 2.0

Can anyone else confirm this? (I’d think 2 would be enough, but just in case - pref someone running XP or linux)

Even better - can nVidia acknowledge/confirm this?

Just to get a quick understanding :

Is this problem only with driver-API , the context thing that u were mentioning earlier???


Is it a problem with all kernels??


I’m not sure if it’s Driver API specific - I do know it applies to any/all kernels though (all of mine + nVidia’s Driver API sample somewhat proves that, although not conclusively).

It could very well be a side-effect of having a ‘stale’ context, I’ve not tested that theory.

Hmm… I have never used a context… So, I would like to know how my kernels are faring… Thanks for the useful input!

If I get time to test it, I will post my result here. Thanks

Doesn’t happen on an 8800GT with XP 32. I left the program idle for 10 minutes, but still get the same speed, between 880 and 910 Mpixels/s.

The only thing I did get was the crashing after holding down a key for 10 seconds.

So maybe its Vista/64 related??

Indeed it would appear this is a Vista issue (both 32bit and 64bit), I’ve managed to confirm it on other Vista boxes - but I won’t be able to get XP installed till after the Christmas break to confirm that it’s not an issue on my end.

Sadly this doesn’t seem to be directly related to being idle - I’m noticing very similar symptoms after extended execution (in a real time system, constantly calling CUDA kernels (~30+ kernels a frame) - so it’s certainly not idle)… My application starts off perfectly for 10+ seconds, with performance degradations in memory transfers coming shortly after - then kernel execution (in addition) not long after that.

I didn’t notice this until now simply because of the difficulty involved in properly profiling the application…

Can anyone from nVidia confirm this issue in Vista (from my example posted above)?

It’s been over a month now, and no response from nVidia… what’s going on guys?

It’s always good to dredge up an old thread. :)

I’m having the same types of behavior with a GTX 260 under 64-bit Linux with a program written using the driver API. One question for you already here: What happens under the PowerMizer settings in NVIDIA’s Settings program? In my case, as execution continues, it moves from Performance Level 2 (576 MHz clock, 999 MHz Memory clock) to Level 1 (400 CPU/300 Memory), then after some more time moves to Level 0 (300 CPU/100 Memory). When I kill the program and start it again, it starts back at Level 2.

Interestingly enough, this didn’t seem to happen on a 9800 GX2 under 64-bit Linux (at least when computing on a single core).

…Well, it doesn’t happen on the GX2 because there is only one performance level.

Is there an open bug about this behavior?