Reset context after kernel failure?

cu_init · August 10, 2017, 10:32pm

I have a kernel which fails due to an illegal memory access and I want to recover from this failure.

After the kernel fails I try to destroy the old context with:

cuDevicePrimaryCtxReset

Then I try to create a new context with either:

cuDevicePrimaryCtxRetain

or

cuCtxCreate

The problems seams to be that the old error message from the failed kernel launch still has not been reset and the creation of a new context fails with the old error message.

Is it possible to recover from a broken context?

nvidia-smi tells me that I am using driver version: 375.66
OS: Ubuntu
nvcc --version: Cuda compilation tools, release 8.0, V8.0.44

Robert_Crovella · August 10, 2017, 10:49pm

It’s possible, yes. For the cuda runtime API, cudaDeviceReset() will do the trick.
I would have to research the exact method in the driver API.

Also, I would recommend updating your CUDA 8.0.44 install to the latest CUDA 8.0.61, just on general principle. Not connected to your inquiry.

cu_init · August 11, 2017, 12:37pm

Thank you for the replay txbob.

I have now tried a few more approaches to try to reset to a new context but none have worked so far.

I have tried to carefully manage the push/pop of the context and then make sure it is deleted after use or after a failure.
I have tried to use the primary context instead of creating a context.
I have tried calling cuDevicePrimaryCtxReset in different locations and multiple times.
I have tried running the CUDA code in a single thread.
I have tried deleting every context I could find

Nothing seams to work and I have so far not been able find any useful hints in the documentation.

Edit: I tried calling cudaDeviceReset but it did not work. I will try to change my code to use only the runtime API instead and then test cudaDeviceReset again.

Topic		Replies	Views
resetting cuda context after kernel failure? CUDA Programming and Performance	1	1664	September 13, 2008
How to re-init the context after cudaResetDevice, now ERROR: cudaErrorContextIsDestroyed CUDA Programming and Performance	8	886	August 24, 2023
How to destroy an OptiX Context that failed to launch OptiX	4	780	June 14, 2022
application crash and device memory CUDA Programming and Performance	4	1063	August 17, 2010
Problems creating green context CUDA Programming and Performance cuda , jetson	4	64	April 29, 2025
cudaThreadExit() cleanup Does it work? CUDA Programming and Performance	6	29710	May 27, 2009
How to reset CUDA error in driver API CUDA Programming and Performance	5	7638	February 18, 2014
Problems when mix using CUDA runtime API and CUDA driver API CUDA Programming and Performance	1	3226	August 6, 2015
How to proceed / reset device from an error 700 or 716 in cuda driver API? CUDA Programming and Performance	10	1979	October 12, 2021
cudaThreadExit not working Bug Report CUDA Programming and Performance	3	8430	June 24, 2010

Reset context after kernel failure?

Related topics