I would like to limit the maximum execution time of a kernel.
Initially I thought of exploiting the TDR mechanism in Windows, but then I discovered that when the TDR kicks in my entire application gets killed (not only the kernel).
I would like my application to continue regularly after the kernel “fail” with the awareness of the failed GPU execution attempt.
I tried something like this (GTX 560Ti, Windows 7 64 bit, TDR=4seconds)
mykernel<<<block_count, thread_count>>(arg0, arg1);
Sleep(2000); //I want my kernel to execute for at most 2 seconds
cudaMemcpy(...); //Copy partial results back to host memory
cudaDeviceReset(); //Kill the kernel?
But this doesn’t work. It seems that on cudaDeviceReset the application hangs and then it gets killed by TDR.
While this does not protect (fully) against runaway kernel threads, you can place a flag in device memory that kernel threads test either when beginning to run or even periodically. If your GPU supports concurrent copy and execute, you can then set it from the host to abort the kernel without producing an error condition, and potentially also saving the work that has already been done.
If you want to make this solution a bit more robust against runaway threads, a thread that has detected the flag to be set can execute [font=“Courier New”]asm volatile(“trap;”);[/font] to abort all other threads as well.
Another strategy to protect against runaway threads would be to have each thread check [font=“Courier New”]clock64()[/font] regularly and abort if the difference to the starting time of the threads gets too large. Keep in mind though that [font=“Courier New”]clock()[/font] and [font=“Courier New”]clock64()[/font] are not guaranteed to be consistent between different threads.
In you original approach, what are the arguments to the cudaMemcpy() call - do you put it on a different stream than the kernel? Otherwise it will have to wait for the kernel to finish, which it (by definition) doesn’t in the runaway case.
The “asm trap” problem is not really a issue because I always run one kernel at a time.
Now that you mention it, the “trap;” instruction will kill just the calling kernel or every kernel currently running on the device?
EDIT: I made a “killer kernel” which simply executes a trap instruction. When I concurrently launch this kernel all the kernel currently executing gets killed. However, after the trap instruction CUDA doesn’t work anymore, even after a cudaDeviceReset. Any ideas?
EDIT: Actually the device still works after the cudaDeviceReset(). It is just that I had to call cudaGetLastError() to reset the error flag…Now, I fixed part of my problem (the killing the wild kernel), but now after the cudaDeviceReset my partial results are gone. Any idea how to circumvent the problem?