Cuda error 30 (Unknown error) after millions of kernel launches

Hi there,

I have a program that processes video, which fails with CUDA error 30 (“Unknown error”) after several million frames. Although that’s a lot of video, the failure rate is still unacceptable when spread across many machines.

I’ve seen from these forums that error 30 is associated with memory addressing issues, e.g. out-of-bounds access to shared memory, or global memory bound to a texture. So, I’ve turned off all my kernels, leaving just the CUDA mallocs and free’s, memcpys, texture binding, and the code required to create and use streams. I still get the same error, but after more iterations.

I’ve run the whole program (slowly) in the NSight debugger, with memory checking turned on. This didn’t show up any issues, at least on the few thousand iterations I ran. I’ve put guard blocks around all my host and device memory allocations, and they’re not getting corrupted.

Do you have any suggestions for what I might try next?


[Software - Windows 7 x64, CUDA Toolkit 3.2.16, driver 275.33; Hardware - GTX460M, Core i7 Q720].

try cudamemcheck, cuda 4.0 and cuda 4.1. Also does it fail after millions of continued frames, or if you launch program 1000 times with 1000 frames it will fail one time? Maybe it is memory leak in sdk and too many kernell launches etc make a failure.

It seems like you have already spent time to convert the full app into a simplified version that basically constitutes a driver stress test. I have personally successfully run applications with tens of millions of kernel launches on Windows, but these weren’t using as many CUDA features as your app is exercising. It is possible that the more comprehensive set of features exercised by your app exposes a driver problem that previously eluded capture during testing.

Any issue that existed with CUDA 3.2 and the accompanying driver set may have already been fixed by now. If possible, try upgrading your software stack to either the latest released version (CUDA 4.0), or the latest release candidate (CUDA 4.1 RC1, available to registered developers). If the problem still reproduces with the stripped-down version of the code on either CUDA 4.0 or CUDA 4.1 RC1, I would suggest filing a bug, attaching the stripped down code.

Hi Lev, thanks for your reply.

I actually launch the program once, and run a cycle of “configure, play N frames, stop”.

Any thoughts on how to test your idea of a memory leak in the SDK?



Hi njuffa, thanks for your reply. It’s good to know that you’ve successfully tested tens of millions of kernel launches.

I will try CUDA 4.1 RC1, I think.

A couple of points I forgot to mention:

  1. When I get the “error 30”, sometimes I also get corruption of the screen display (this is on a laptop, so only one adapter).

  2. To clear the error, I have found it necessary to shut down Windows completely, and then power up. If I just do a Restart (from the Start menu), the “error 30” will happen much sooner.

I’d be interested to hear if you’ve anything to add.



It also maybe problem with memory management in your program. Somehow cuda gets corrupted memory. Try to allocate memory one time.

I have no experience with it myself, but could it be related to this problem? Does your notebook have a Fermi class gpu?

But here cuda code is removed, so it is less likely, though everything could be.

Hi tera,

Thanks for bringing that thread to my attention. I’ve only had a quick look at it so far, but there’s one similarity: I see the GPU go into an “error” state that can only be cleared by cycling the power (at least when my test includes running the kernels). This is running on a Fermi GPU.

If I don’t run the kernels, and just do the memcpy’s etc, it takes longer to fail. I haven’t checked yet to see if there is a persistent error state (I’ve got into the habit of cycling the power before every test, to ensure stable initial conditions).

My stress test with kernels also fails on a GTX285M (G92b architecture), but I haven’t seen display corruption or a persistent error state. There is perhaps more than one thing going on here. I can’t rule out the possibility that there’s something wrong with my CPU code, as my test app is still pretty complicated. Perhaps the Fermi driver doesn’t handle bad input quite so well as the older driver.