I am using CUDA to do a data processing software. I would like to keep the software running even if the CUDA crashes.
I use cudaSetDevice to init the gpu, and possibly my kernel crashes due to “time out exception”. Then the screen turns black and the display driver recovers after a few seconds.
What I did is to catch the exceptions, and then in the catch clause I used cudaResetDevice and it returned success. However I found no CUDA API call can succeed afterwards.
I also tried another case: I init the gpu using cuCtxCreate, and in the “catch clause” i use cuCtxDestroy to destroy the crashed context, however if I call cuCtxCreate again, it gave me CUDA_ERROR_UNKNOWN, i have check the cuda compute mode, it was “CU_COMPUTEMODE_DEFAULT”.
Why was it like this? Can anyone help me. Thanks very much
in addition to “time out exception”, your kernel may equally crash simply due to a bug
on occasion, the bug may be so severe that the only way to recover the device, is to physically reset it - reboot or suspend/ sleep; in such cases, cudaResetDevice, etc will not safe you
you need to rule out the possibility that your kernel contains a major bug that moves the device into undefined behaviour, proper
Thanks for your reply. Yes I agree with you that my kernel may contain a major bug. That is what I should avoid as much as possible.
You mentioned that the bug maybe too severe and the device needs to be physically reset. In my case, after my display driver automatically recovered from the black screen, if I re-run my program again, CUDA API will work again, actually I didn't reset the device physically.
So in this case, is it possible for me to make CUDA work without re-run my program? I can switch to equivalent CPU codes for the crashed part, and continue to use CUDA for the following processing. Or like you said, cudaResetDevice will still not safe me?
Thank you.
Thanks for your reply. Yes I agree with you that my kernel may contain a major bug. That is what I should avoid as much as possible.
You mentioned that the bug maybe too severe and the device needs to be physically reset. In my case, after my display driver automatically recovered from the black screen, if I re-run my program again, CUDA API will work again, actually I didn't reset the device physically.
So in this case, is it possible for me to make CUDA work without re-run my program? I can switch to equivalent CPU codes for the crashed part, and continue to use CUDA for the following processing. Or like you said, cudaResetDevice will still not safe me?
Thank you.
" if I re-run my program again, CUDA API will work again, actually I didn’t reset the device physically. "
it may work, but is it really working…?
i have come to adopt the rule of thumb that, if my program crashes, i reset the device, particularly when debugging still, as, on occasion, the device recovers, but i no longer obtain the same results - only when i reset does everything really go back to normal
and i think that the philosophy of prevention being better than cure really applies here; the more i debug my code, the greater its reliability and stability, such that crashes no longer occur, such that device ‘recovery’ is hardly needed
it is a good thing if your code crashes, as it forces you to re-examine
In my case I still obtain the same results if I successfully reset the cuda device. I wonder why it could be different.
And thank you for giving me the advice. It’s always better to kill the bugs first. Sometimes it’s hard for me to debug because I don’t have the customers’ image data when they experienced the CUDA crashes. At the moment I just switched the processing into CPU and I can’t exit the program, because the processing may take many days.
“In my case I still obtain the same results if I successfully reset the cuda device. I wonder why it could be different.”
it is very likely application-specific
if you can detect when/ whether a crash occurs, you should be able to pre-collect or post-collect user-data for debugging purposes, provided that the data is not too big
pre-collect: temporarily store the user input and either dump it or discard it, depending on whether a crash occurred
post-collect: again request the same user input that causes a crash, when a crash occurs
it seems as if you are able to detect when/ whether a crash occurs