Hi, all!
I find myself with a kernel that occasionally, depending on the settings the user provides, times out because of the Windows watchdog timer (“The launch timed out and was terminated”). I’ve read and posted and finally gave up on disabling/changing the timeout value, but I still need to be able to recover from this error.
I’ve trapped it with cudaGetLastError, so I know when it happens and the program doesn’t instantly terminate, but whenever I hit the error, all succeeding cuda calls (including/especially cudaMemcpy) result in the same error without any time delay at all.
So am I just not clearing the error correctly? (The manual seems to indicate that cudaGetLastError should do the job and doesn’t provide any alternatives) Is a launch timeout so catastrophic that I simply can’t recover from it and continue using cuda? Is there any way to “reset” my relationship with the device and let it talk to me again?
Gist of my code:
kernel<<<...>>>(...); // times out
cudaEventRecord( estop, 0 );
...
cudaEventSynchronize( estop );
cudaError_t err = cudaGetLastError(); // err = cudaErrorLaunchTimeout
// handle the error and get on with life
...
// attempt a memcpy
CUDA_SAFE_CALL( cudaMemcpy(...) );
// check errors again --> same launch timeout error
Any suggestions?
Thanks in advance!
Ben Weiss
Oregon State University Graphics