Clearing Cuda Errors

I’m working on a program which has long runs and calls a CUDA kernel many times. I, like a couple other people who have posted, get occasional (and randomly occurring) “unspecified launch failure” errors. Based upon those previous posts, I’m thinking it’s due to the fault tolerances of my specific GPU.

Rather than kill the whole process, I’d like to be able to clear the error after it occurs, wait a few seconds, and try again. However, any subsequent error check fails due to the first, and none of the documentation or forum discussion I’ve found discusses clearing CUDA errors. Is there a way to clear CUDA errors?

Other options include:
(1) starting a separate process which calls the CUDA kernel, and check the return code.
(2) check the returned data directly
neither of which I find ideal due to the computational overhead.

Any thoughts? Thanks…

Hi,

I have exactly the same problem and have tried cudaThreadExit as well as changing the code to running with the driver API and using cuCtxDestroy etc to no avail.

In the driver API version I also find that the cuMemFree does not work properly, it returns a CUDA_INVALID_VALUE error and shortly thereafter I start getting a CUDA_ERROR_OUT_OF_MEMORY error.

I would appreciate if someone could assist us in this regard. Thanks in advance.

For clearing error, first check the exact cause and location of first error. Then you can explicitly set error to cudaSuccess. Surround each kernel call by cudaGetLastError. This will give you error of last operation performed.Intrinsically cudaGetLastError sets error code to cudaSuccess. HTH.

My problem is not to get rid of the error code, rather to get the GPU working again. The problem is once you get something like Error 4: Unspecified launch error, you need to stop and exit the host program. When you run it again the GPU responds fine again. In my application I do a lot of iterations with varying parameters. Every now and then the GPU executes longer than the timeout period an activates Error 4. I would like to reset the GPU environment completely without exiting the host program and carry on with the rest of the algorithm. As mentioned, up to now I had no success. I might just mention that the default Cuda handlers (like the SAFE_CALL variety) perform an exit and thus this problem is not obvious.

I’m am seeing the same error where I’m using the GPU for up to 4-5 hours flawlessly and then I get “unspecified launch failure.” I do not,
nor can not reset the host application as this is an API/SDK library that we are developing.

Any thoughts?

I am using CUDA from MATLAB mostly, and when I detect a cuda error in matlab, I clear the mex file. Which is like clearing a dll. Maybe if you host program loads the cuda part from a .dll/.so, you can detect the error in the host part, unload the dll/so and reload it? I have no idea, since I have never dealt with dll’s myself.

Unfortunately the host program would be our customer. So unloading the .so file would not be an option.
I’m wondering if it is a driver issue. I’m using the 9.04 Ubuntu beta driver on 9.10 Ubuntu. Is a non-beta driver going to be released??