I have a program that processes video, which fails with CUDA error 30 (“Unknown error”) after several million frames. Although that’s a lot of video, the failure rate is still unacceptable when spread across many machines.
I’ve seen from these forums that error 30 is associated with memory addressing issues, e.g. out-of-bounds access to shared memory, or global memory bound to a texture. So, I’ve turned off all my kernels, leaving just the CUDA mallocs and free’s, memcpys, texture binding, and the code required to create and use streams. I still get the same error, but after more iterations.
I’ve run the whole program (slowly) in the NSight debugger, with memory checking turned on. This didn’t show up any issues, at least on the few thousand iterations I ran. I’ve put guard blocks around all my host and device memory allocations, and they’re not getting corrupted.
Do you have any suggestions for what I might try next?
[Software - Windows 7 x64, CUDA Toolkit 3.2.16, driver 275.33; Hardware - GTX460M, Core i7 Q720].
try cudamemcheck, cuda 4.0 and cuda 4.1. Also does it fail after millions of continued frames, or if you launch program 1000 times with 1000 frames it will fail one time? Maybe it is memory leak in sdk and too many kernell launches etc make a failure.
It seems like you have already spent time to convert the full app into a simplified version that basically constitutes a driver stress test. I have personally successfully run applications with tens of millions of kernel launches on Windows, but these weren’t using as many CUDA features as your app is exercising. It is possible that the more comprehensive set of features exercised by your app exposes a driver problem that previously eluded capture during testing.
Any issue that existed with CUDA 3.2 and the accompanying driver set may have already been fixed by now. If possible, try upgrading your software stack to either the latest released version (CUDA 4.0), or the latest release candidate (CUDA 4.1 RC1, available to registered developers). If the problem still reproduces with the stripped-down version of the code on either CUDA 4.0 or CUDA 4.1 RC1, I would suggest filing a bug, attaching the stripped down code.
Thanks for bringing that thread to my attention. I’ve only had a quick look at it so far, but there’s one similarity: I see the GPU go into an “error” state that can only be cleared by cycling the power (at least when my test includes running the kernels). This is running on a Fermi GPU.
If I don’t run the kernels, and just do the memcpy’s etc, it takes longer to fail. I haven’t checked yet to see if there is a persistent error state (I’ve got into the habit of cycling the power before every test, to ensure stable initial conditions).
My stress test with kernels also fails on a GTX285M (G92b architecture), but I haven’t seen display corruption or a persistent error state. There is perhaps more than one thing going on here. I can’t rule out the possibility that there’s something wrong with my CPU code, as my test app is still pretty complicated. Perhaps the Fermi driver doesn’t handle bad input quite so well as the older driver.