Manually kill kernel before TDR

I would like to limit the maximum execution time of a kernel.

Initially I thought of exploiting the TDR mechanism in Windows, but then I discovered that when the TDR kicks in my entire application gets killed (not only the kernel).

I would like my application to continue regularly after the kernel “fail” with the awareness of the failed GPU execution attempt.

I tried something like this (GTX 560Ti, Windows 7 64 bit, TDR=4seconds)

...

mykernel<<<block_count, thread_count>>(arg0, arg1);

Sleep(2000); //I want my kernel to execute for at most 2 seconds

cudaMemcpy(...); //Copy partial results back to host memory

cudaDeviceReset(); //Kill the kernel?

...

But this doesn’t work. It seems that on cudaDeviceReset the application hangs and then it gets killed by TDR.

Is there any way to achieve what I have in mind?

While this does not protect (fully) against runaway kernel threads, you can place a flag in device memory that kernel threads test either when beginning to run or even periodically. If your GPU supports concurrent copy and execute, you can then set it from the host to abort the kernel without producing an error condition, and potentially also saving the work that has already been done.

If you want to make this solution a bit more robust against runaway threads, a thread that has detected the flag to be set can execute [font=“Courier New”]asm volatile(“trap;”);[/font] to abort all other threads as well.

Another strategy to protect against runaway threads would be to have each thread check [font=“Courier New”]clock64()[/font] regularly and abort if the difference to the starting time of the threads gets too large. Keep in mind though that [font=“Courier New”]clock()[/font] and [font=“Courier New”]clock64()[/font] are not guaranteed to be consistent between different threads.

Thank you. Actually I was looking for a “purely CPU-based” approach without any modification in the kernel, but your advices are still very useful.

In you original approach, what are the arguments to the cudaMemcpy() call - do you put it on a different stream than the kernel? Otherwise it will have to wait for the kernel to finish, which it (by definition) doesn’t in the runaway case.

It was in the same stream, yes.

So you are saying that if I perform the memcpy in another stream it could work?

Yes. It definitely can’t work if they are in the same stream.

It still doesn’t work.

Here is what I did:

cudaStream_t primary;

cudaStream_t secondary;

cudaCreateStream( &primary );

cudaCreateStream( &secondary );

myKernel<<<num_blocks, num_threads, 0, primary>>>( d_input_data, d_results );

Sleep(2000);

cudaMemcpyAsync( h_results, d_results, num_threads * num_blocks * sizeof(*d_success), cudaMemcpyDeviceToHost, secondary );

cudaStreamSynchronize( secondary );

cudaDeviceReset();

Using Visual studio 2010 I noticed that the applications hangs at cudaMemcpyAsync (even though it is supposed to be async).

The the TDR kicks in and the application continues without being killed, but all the next calls to any cuda function will return cudaErrorsDevicesUnavailable.

I tried to reset the device again, or to do cudaSetDevice(0);cudaDeviceReset() with no results.

The only way to fix it is to restart the application.

I also tried using mapped memory and while I got the partial results, the GPU becomes useless after the TDR timeout.

I don’t think there is a way to do it from host only. Also, calling asm trap will force you to call cudaDeviceReset as all your CUDA API calls will fail after asm trap until you do a cudaDeviceReset.

The “asm trap” problem is not really a issue because I always run one kernel at a time.

Now that you mention it, the “trap;” instruction will kill just the calling kernel or every kernel currently running on the device?

EDIT: I made a “killer kernel” which simply executes a trap instruction. When I concurrently launch this kernel all the kernel currently executing gets killed. However, after the trap instruction CUDA doesn’t work anymore, even after a cudaDeviceReset. Any ideas?

EDIT: Actually the device still works after the cudaDeviceReset(). It is just that I had to call cudaGetLastError() to reset the error flag…Now, I fixed part of my problem (the killing the wild kernel), but now after the cudaDeviceReset my partial results are gone. Any idea how to circumvent the problem?