I have a kernel which I call multiple times (indefinitely until the program terminates) - for testing purposes (to attempt and diagnose what’s causing this performance problem) - I’ve made it so this kernel gets identical inputs, and I’ve checked that it’s giving me identical outputs each time…
On average, this kernel takes ~4ms to execute - however after an arbitrary number of executions (it’s happened after 2-3, sometimes after 30-1000+, sometimes never) it’s execution time runs up to 9-12ms (again, identical inputs/outputs, and no errors) - and consistently runs at it’s (2-3 times) slower speed for the rest of the processes life span.
To re-iterate, the code paths taken in both the kernel, and CPU (Driver API) code are identical, and receive/output identical values each time I call this function - all running in the same CPU thread & CUDA device.
I should note, each time I call this kernel - I push a context (which I create at the beginning of the program), allocate new memory, transfer the memory from host->device, synchronize the async memcpy’s, start event timer, launch the kernel, synchronize, stop event timer, clean up memory & pop context. Note: The timings I’m referring to are between the start and stop events.
My question is, what can be causing these unpredictable performance degradations? And more importantly, what can I do to avoid them entirely?
CUDA Version: 2.0
Operating System: Windows Vista (32bit)
Card: Quadro FX 570
Driver Version: 177.84
P.S. I’m also calling other kernels both before and after this kernel - however I synchronize both before and after this kernel - so I can’t see how those other kernels would affect the performance of this kernel?
I am - however there’s nothing but Visual Studio running in this case. (I even tried disabling Aero, same results - not that I’d expect something as basic as Aero to interfere with kernel timings.) Additionally, I don’t see this behavior in any of my other kernels (although none of them are as time consuming as this one).
I’ve also managed to isolate the test case, so I’m now only running one kernel (nothing before/after it) - I still see identical results.
The kernel uses 0lmem, 5840smem, and 11 registers.
Edit: Would I be right in assuming that if some other application was using the GPU via OpenGL/D3D - that they could be consuming smem? - thus only allowing 1 block to run at a time (if they used enough smem), as opposed to 2 blocks assuming the full 16kb smem was free?
Second Edit: My assumption clearly isn’t correct - I managed to bring the smem down to 1508 bytes (allowing more than 2 blocks at a time, but introduced 4-way bank conflicts - bringing it up to ~5ms), and it exhibits identical behaviour - after a few iterations, it jumps up to ~12ms.
Have you tried timing the various pieces of what you’re doing individually? I don’t know how the CUDA timer works, I’d suggest using Windows’s QueryPerformanceCounter() and throwing in a lot of cudaThreadSynchronize()'s everywhere. It could be some sort of shortcoming in any one of the cuda calls you make. The kernel itself shouldn’t slow down, and certainly other 3D apps can’t steal smem (although they can slow down overall execution by a lot).
Btw, have you trying running the Visual Profiler and see if any of the counters change? Maybe the kernel isn’t really doing the same work every iteration.
OpenGL/D3d certainly do use shared mem, but that’s NOT used simultaneously with CUDA… like a CUDA kernel evocation, graphics computes get ALL the 16K shared mem for themselves during their compute. Shared mem splitting only happens in CUDA between blocks of the same kernel. (at least currently… it would be possible that later CUDA would allow mixing of multiple kernels or computes like OpenGL simultaneously, but likely that’d need hardware changes since the scheduler would be more complex. More likely, if such features were added, each SM would run its own kernel and you still wouldn’t split shared mem among differing kernels).
Actually, I think it’d be possible to run different kernels on the same SM. You can emulate it easily by putting the two sets of machine code side-by-side in memory, and putting an if() in front.
But it doesn’t happen now. When it does, it will have several benefits. Performance could be better since running kernels with complementary load patterns improves the utilization of resources. And we won’t need the watchdog timer.
I don’t think it’s a problem with CUDA, it appears to be a problem with how I’m using the Driver API - thorough testing has indicated this happens with ALL of my kernels (…eventually). After it happens, it seem to half the performance of ALL CUDAand even OpenGL related performance (not only kernel calls, but simply device->device, host->device, and device->host memory copies, and even OpenGL rendering times) for either my process.
Note: Starting a new process (keeping the old one active) yields expected performance levels from BOTH processes again, but eventually they both suffer from the same performance degradations after a while.
I’ve tried to register at least 2 times before, never had a response.
So how does everyone else use the Driver API? Do they create new context for each kernel call? or do they create a single context per device on startup, or something else entirely? (As I think I said before, I’m creating one context on startup - and using it for ALL kernel calls throughout the application.)
Again, any insight into what ‘might’ cause this behavior would be greatly appreciated.
I think you’re right - I managed to reproduce the same problem with nVidia’s sample “simpleTexDrv” (the only sample that uses the Driver API afaik) - after some slight modifications that allow me to run the kernel multiple times without restarting the application.
At first, the program takes ~2ms to complete it’s copy - after a while (to speed up the process, I just alt-tabbed, opened up some other apps, let the sample app idle for about 30-60 seconds) it dropped down to 8ms.
[codebox]# Processing time: 2.237365 (ms)
** Press any key to run again, or Escape to exit **
Processing time: 8.021899 (ms)
** Press any key to run again, or Escape to exit **[/codebox]
See attached for modified sample code, with VS 2008 (VS9) solution. Simply copy it into the CUDA SDK’s “projects” directory - (copy over the original “simpleTextureDrv” project files into the new directory if you want to use 2005) & run the program.
It’ll re-run the kernel each time you press a key, I was getting ~2ms (on my Quadro FX 570) consistently for a while, then alt-tabbed to Firefox (was about to post that I couldn’t reproduce the problem) - went back to the sample app, pressed a key to re-run the kernel, and got 8ms. (It probably takes ~30-60 seconds of idle time.)
Note: Holding down any key to constantly run the kernel, eventually crashes after ~10 seconds with “The program ‘ simpleTextureDrv.exe: Native’ has exited with code 1 (0x1).”
It’s happening to me too. Exactly how you described it. After a few minutes performance drops down 2x. I open another instance of the app, and the first instance goes back to normal. Another few minutes, and both are 2x slow again. (Third instance fixes everything again for a bit, and so on.)
Indeed it would appear this is a Vista issue (both 32bit and 64bit), I’ve managed to confirm it on other Vista boxes - but I won’t be able to get XP installed till after the Christmas break to confirm that it’s not an issue on my end.
Sadly this doesn’t seem to be directly related to being idle - I’m noticing very similar symptoms after extended execution (in a real time system, constantly calling CUDA kernels (~30+ kernels a frame) - so it’s certainly not idle)… My application starts off perfectly for 10+ seconds, with performance degradations in memory transfers coming shortly after - then kernel execution (in addition) not long after that.
I didn’t notice this until now simply because of the difficulty involved in properly profiling the application…
Can anyone from nVidia confirm this issue in Vista (from my example posted above)?
I’m having the same types of behavior with a GTX 260 under 64-bit Linux with a program written using the driver API. One question for you already here: What happens under the PowerMizer settings in NVIDIA’s Settings program? In my case, as execution continues, it moves from Performance Level 2 (576 MHz clock, 999 MHz Memory clock) to Level 1 (400 CPU/300 Memory), then after some more time moves to Level 0 (300 CPU/100 Memory). When I kill the program and start it again, it starts back at Level 2.
Interestingly enough, this didn’t seem to happen on a 9800 GX2 under 64-bit Linux (at least when computing on a single core).