I have a kernel which fails due to an illegal memory access and I want to recover from this failure.
After the kernel fails I try to destroy the old context with:
cuDevicePrimaryCtxReset
Then I try to create a new context with either:
cuDevicePrimaryCtxRetain
or
cuCtxCreate
The problems seams to be that the old error message from the failed kernel launch still has not been reset and the creation of a new context fails with the old error message.
Is it possible to recover from a broken context?
nvidia-smi tells me that I am using driver version: 375.66
OS: Ubuntu
nvcc --version: Cuda compilation tools, release 8.0, V8.0.44
It’s possible, yes. For the cuda runtime API, cudaDeviceReset() will do the trick.
I would have to research the exact method in the driver API.
Also, I would recommend updating your CUDA 8.0.44 install to the latest CUDA 8.0.61, just on general principle. Not connected to your inquiry.
Thank you for the replay txbob.
I have now tried a few more approaches to try to reset to a new context but none have worked so far.
- I have tried to carefully manage the push/pop of the context and then make sure it is deleted after use or after a failure.
- I have tried to use the primary context instead of creating a context.
- I have tried calling cuDevicePrimaryCtxReset in different locations and multiple times.
- I have tried running the CUDA code in a single thread.
- I have tried deleting every context I could find
Nothing seams to work and I have so far not been able find any useful hints in the documentation.
Edit: I tried calling cudaDeviceReset but it did not work. I will try to change my code to use only the runtime API instead and then test cudaDeviceReset again.