Reset context after kernel failure?

I have a kernel which fails due to an illegal memory access and I want to recover from this failure.

After the kernel fails I try to destroy the old context with:

cuDevicePrimaryCtxReset

Then I try to create a new context with either:

cuDevicePrimaryCtxRetain

or

cuCtxCreate

The problems seams to be that the old error message from the failed kernel launch still has not been reset and the creation of a new context fails with the old error message.

Is it possible to recover from a broken context?

nvidia-smi tells me that I am using driver version: 375.66
OS: Ubuntu
nvcc --version: Cuda compilation tools, release 8.0, V8.0.44

It’s possible, yes. For the cuda runtime API, cudaDeviceReset() will do the trick.
I would have to research the exact method in the driver API.

Also, I would recommend updating your CUDA 8.0.44 install to the latest CUDA 8.0.61, just on general principle. Not connected to your inquiry.

Thank you for the replay txbob.

I have now tried a few more approaches to try to reset to a new context but none have worked so far.

  • I have tried to carefully manage the push/pop of the context and then make sure it is deleted after use or after a failure.
  • I have tried to use the primary context instead of creating a context.
  • I have tried calling cuDevicePrimaryCtxReset in different locations and multiple times.
  • I have tried running the CUDA code in a single thread.
  • I have tried deleting every context I could find

Nothing seams to work and I have so far not been able find any useful hints in the documentation.

Edit: I tried calling cudaDeviceReset but it did not work. I will try to change my code to use only the runtime API instead and then test cudaDeviceReset again.