Recovering from watchdog timeout

BenW · July 24, 2008, 12:30am

Hi, all!

I find myself with a kernel that occasionally, depending on the settings the user provides, times out because of the Windows watchdog timer (“The launch timed out and was terminated”). I’ve read and posted and finally gave up on disabling/changing the timeout value, but I still need to be able to recover from this error.

I’ve trapped it with cudaGetLastError, so I know when it happens and the program doesn’t instantly terminate, but whenever I hit the error, all succeeding cuda calls (including/especially cudaMemcpy) result in the same error without any time delay at all.

So am I just not clearing the error correctly? (The manual seems to indicate that cudaGetLastError should do the job and doesn’t provide any alternatives) Is a launch timeout so catastrophic that I simply can’t recover from it and continue using cuda? Is there any way to “reset” my relationship with the device and let it talk to me again?

Gist of my code:

kernel<<<...>>>(...);    // times out

cudaEventRecord( estop, 0 );

...

cudaEventSynchronize( estop );

cudaError_t err = cudaGetLastError();   // err = cudaErrorLaunchTimeout

// handle the error and get on with life

...

// attempt a memcpy

CUDA_SAFE_CALL( cudaMemcpy(...) );

// check errors again --> same launch timeout error

Any suggestions?

Thanks in advance!

Ben Weiss

Oregon State University Graphics

BarsMonster · July 24, 2008, 8:35am

Have you tried to terminate all threads which were touching CUDA and create them again?

BenW · July 24, 2008, 4:06pm

Thanks; I’ll try that.

BenW · July 24, 2008, 5:38pm

Well, it sort of worked; I can make cuda calls from other threads that don’t error. But to get there I moved most of my cuda code to a slave thread, which is causing cudaMemcpyToSymbol (which is called to copy to a constant-length chunk of module-level static constant memory) to fail with “Invalid Device Pointer”, making it so I can’t write to constant memory from that thread; and my code needs to be able to do that.

Does anyone have any ideas on how to fix this problem? I either need to re-start cuda without changing threads or be able to write to constant memory declared in another thread.

I tried passing both the pointer to the symbol and the text version of the symbol with the same result.

Thanks!

Topic		Replies	Views
CUDA kernel timeout CUDA Programming and Performance	12	58802	December 22, 2022
Error in lunching a kernel "the launch timed out and was terminated" CUDA Programming and Performance	1	892	April 13, 2011
Error on iteration of cuda kernel CUDA Programming and Performance	4	4345	July 11, 2011
Fatal error:the launch timed out and was terminated CUDA Programming and Performance	5	9770	April 19, 2016
question about "launch timed out" CUDA Programming and Performance	2	1389	April 24, 2009
the launch timed out and was terminated strange error on cudamemcpy CUDA Programming and Performance	2	4407	November 29, 2012
Watchdog Timer What exactly is the watchdog timer? CUDA Programming and Performance	4	15942	July 8, 2008
cudaErrorLaunchTimeout error - how to repair after it happens ? CUDA Programming and Performance	1	1505	November 21, 2010
CUDA Timeout? CUDA Programming and Performance	7	27694	December 19, 2011
cudaErrorLaunchTimeout and CUDA2.0 CUDA Programming and Performance	4	2110	July 2, 2008

Recovering from watchdog timeout

Related topics