How to recover CUDA after the display driver has crashed and recovered(caused by cuda crash)?

Hello:

I am using CUDA to do a data processing software. I would like to keep the software running even if the CUDA crashes.

I use cudaSetDevice to init the gpu, and possibly my kernel crashes due to “time out exception”. Then the screen turns black and the display driver recovers after a few seconds.

What I did is to catch the exceptions, and then in the catch clause I used cudaResetDevice and it returned success. However I found no CUDA API call can succeed afterwards.

I also tried another case: I init the gpu using cuCtxCreate, and in the “catch clause” i use cuCtxDestroy to destroy the crashed context, however if I call cuCtxCreate again, it gave me CUDA_ERROR_UNKNOWN, i have check the cuda compute mode, it was “CU_COMPUTEMODE_DEFAULT”.

Why was it like this? Can anyone help me. Thanks very much

in addition to “time out exception”, your kernel may equally crash simply due to a bug

on occasion, the bug may be so severe that the only way to recover the device, is to physically reset it - reboot or suspend/ sleep; in such cases, cudaResetDevice, etc will not safe you

you need to rule out the possibility that your kernel contains a major bug that moves the device into undefined behaviour, proper

Hi little_jimmy:

Thanks for your reply. Yes I agree with you that my kernel may contain a major bug. That is what I should avoid as much as possible.

You mentioned that the bug maybe too severe and the device needs to be physically reset. In my case, after my display driver automatically recovered from the black screen, if I re-run my program again, CUDA API will work again, actually I didn't reset the device physically. 

So in this case, is it possible for me to make CUDA work without re-run my program? I can switch to equivalent CPU codes for the crashed part, and continue to use CUDA for the following processing.  Or like you said, cudaResetDevice will still not safe me?

Thank you.

Hi little_jimmy:

Thanks for your reply. Yes I agree with you that my kernel may contain a major bug. That is what I should avoid as much as possible.

You mentioned that the bug maybe too severe and the device needs to be physically reset. In my case, after my display driver automatically recovered from the black screen, if I re-run my program again, CUDA API will work again, actually I didn't reset the device physically. 

So in this case, is it possible for me to make CUDA work without re-run my program? I can switch to equivalent CPU codes for the crashed part, and continue to use CUDA for the following processing.  Or like you said, cudaResetDevice will still not safe me?

Thank you.

" if I re-run my program again, CUDA API will work again, actually I didn’t reset the device physically. "

it may work, but is it really working…?

i have come to adopt the rule of thumb that, if my program crashes, i reset the device, particularly when debugging still, as, on occasion, the device recovers, but i no longer obtain the same results - only when i reset does everything really go back to normal

and i think that the philosophy of prevention being better than cure really applies here; the more i debug my code, the greater its reliability and stability, such that crashes no longer occur, such that device ‘recovery’ is hardly needed
it is a good thing if your code crashes, as it forces you to re-examine

Hi little_jimmy:

In my case I still obtain the same results if I successfully reset the cuda device. I wonder why it could be different.

And thank you for giving me the advice. It’s always better to kill the bugs first. Sometimes it’s hard for me to debug because I don’t have the customers’ image data when they experienced the CUDA crashes. At the moment I just switched the processing into CPU and I can’t exit the program, because the processing may take many days.

Many thanks.

“In my case I still obtain the same results if I successfully reset the cuda device. I wonder why it could be different.”

it is very likely application-specific

if you can detect when/ whether a crash occurs, you should be able to pre-collect or post-collect user-data for debugging purposes, provided that the data is not too big

pre-collect: temporarily store the user input and either dump it or discard it, depending on whether a crash occurred
post-collect: again request the same user input that causes a crash, when a crash occurs

it seems as if you are able to detect when/ whether a crash occurs

Did you (original poster) turn off the “watch dog” ?