Recovering from watchdog timeout

Hi, all!

I find myself with a kernel that occasionally, depending on the settings the user provides, times out because of the Windows watchdog timer (“The launch timed out and was terminated”). I’ve read and posted and finally gave up on disabling/changing the timeout value, but I still need to be able to recover from this error.

I’ve trapped it with cudaGetLastError, so I know when it happens and the program doesn’t instantly terminate, but whenever I hit the error, all succeeding cuda calls (including/especially cudaMemcpy) result in the same error without any time delay at all.

So am I just not clearing the error correctly? (The manual seems to indicate that cudaGetLastError should do the job and doesn’t provide any alternatives) Is a launch timeout so catastrophic that I simply can’t recover from it and continue using cuda? Is there any way to “reset” my relationship with the device and let it talk to me again?

Gist of my code:

kernel<<<...>>>(...);    // times out

cudaEventRecord( estop, 0 );

...

cudaEventSynchronize( estop );

cudaError_t err = cudaGetLastError();   // err = cudaErrorLaunchTimeout

// handle the error and get on with life

...

// attempt a memcpy

CUDA_SAFE_CALL( cudaMemcpy(...) );

// check errors again --> same launch timeout error

Any suggestions?

Thanks in advance!

Ben Weiss

Oregon State University Graphics

Have you tried to terminate all threads which were touching CUDA and create them again?

Thanks; I’ll try that.

Well, it sort of worked; I can make cuda calls from other threads that don’t error. But to get there I moved most of my cuda code to a slave thread, which is causing cudaMemcpyToSymbol (which is called to copy to a constant-length chunk of module-level static constant memory) to fail with “Invalid Device Pointer”, making it so I can’t write to constant memory from that thread; and my code needs to be able to do that.

Does anyone have any ideas on how to fix this problem? I either need to re-start cuda without changing threads or be able to write to constant memory declared in another thread.

I tried passing both the pointer to the symbol and the text version of the symbol with the same result.

Thanks!