How to detect "Display driver stopped responding and has recovered."

There is a event message “Display driver stopped responding and has recovered.” I know it is related to WDDM Timeout Detection and Recovery. My question is how to detect this situation happen in code. call what API.

I find even if a process make the driver stopped, but when the driver recovered, all the process using cuda is in error state and must be restart.
I want to detect this situation in a process and restart all processes in my program. How to do that?

And when the event happens, can I get the process info such as id from the message or driver log?

by the way, my windows 7 ultimate-64 bit. Thank you.

The CUDA runtime API will return error codes when this happens. It is not guaranteed in all cases to be a particular error code, but one of the error codes that may be returned is fairly explicit:

from driver_types.h:

/**
     * This indicates that the device kernel took too long to execute. This can
     * only occur if timeouts are enabled - see the device property
     * \ref ::cudaDeviceProp::kernelExecTimeoutEnabled "kernelExecTimeoutEnabled"
     * for more information.
     * This leaves the process in an inconsistent state and any further CUDA work
     * will return the same error. To continue using CUDA, the process must be terminated
     * and relaunched.
     */
    cudaErrorLaunchTimeout                =      6,

Other error codes that may be returned include unspecified launch failure, etc.

You would know that a kernel hit the timeout by inspecting CUDA status at the next synchronous CUDA API call after launch (as the timeout condition itself will occur asynchronously). In particular, you should see the following status being returned:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

So you might code:

cudaError_t err = cudaDeviceSynchronize();
if (err == cudaErrorLaunchTimeout) [...]

[Forgot to F5 yet again …]

thanks for your reply.
but if the timeout kernel happen in other process and it makes the driver
stopped.so can i detect it in other process.then i can restart the process
automatically.

can anyone help me,thank you.

CUDA in process A knows nothing about CUDA in process B

Therefore you would need to devise a non-CUDA based method to do what you want.

If the process reports an error condition (because its doing proper CUDA error checking) and exits as a result, it would be easy to detect that the process is no longer running, and restart it.

But CUDA doesn’t provide any method to do this.

thank you.but actually when then tdr happens,
the process does not exit and only in error state.
so it is my problem.

When a TDR happens, the process that owns the context which triggered the TDR can know this from the CUDA status: cudaErrorLaunchTimeout. It can react in any way it deems appropriate.

This is just like any other hardware related issue a process may encounter, such as out-of-memory, or disk full or inaccessible. If the process determines that it makes no sense to continue in the presence of such a condition, it can terminate itself. That decision would not be made by the disk subsystem, or the memory allocator.

So, in your CUDA programs, check for cudaErrorLaunchTimeout, and terminate when you detect it, if that is what you want your application to do. Since processes are independent of each other, a process cannot terminate an unrelated process. But the OS gives you methods of killing processes (e.g. use ps to find the process, then use kill to terminate it) as long as you as the user have sufficient permission to do so.

ok thank you.