This has come up several times but am I right in thinking there is no way to specify
a time out for a kernel? Eg I expect my kernel to take 200mS, it would be nice to
ask CUDA to abort it (with suitable status code) if it is still running 2 seconds
after it was launched. Eg this could be an optional parameter when it was launched.
Previous suggested solutions involved
- not writing buggy code, or
- adding explicit device code to the inside of every loop in the kernel
to check for either loop counter over run or tics (clock64) exceeding a threshold.
Comments and suggestions welcome
Kernel execution time is limited by a watchdog timer at OS-level that limits how long the GUI is allowed to stay frozen (for example, because the GPU is busy with a CUDA kernel). Since it is an OS-level policy I am afraid you will have to dial in the desired time limit using OS-specific means.
txbob has, on multiple occassions, posted a link to the relevant information regarding WDDM watch dog timer configuration on Microsoft’s website, but of course now that I am looking for his recommendation I cannot find it. I think it is this page:
This video has specific instructions of how to change the timeout in Windows:
And I believe some flavors of linux also have a timeout as well, but I cannot remember the details about how to adjust.
Since your time limit may be small, using the OS level timeout probably is not the preferred method. Maybe you could have some 64 bit global device unsigned int value which somehow keeps tracks of the passage of time, then set a sync point in your thread blocks to evaluate and perform an atomic update. That will slow your running time though, so not sure if worth it.
Other than that you could break the kernel into smaller groups and have the host keep track of time.
In general an interesting question, and lets us know what you find as a solution.
As far as I am aware, all operating systems supported by CUDA have a watchdog timer that ensures the GUI doesn’t freeze for an indefinite period of time. The watchdog applies only to those GPUs that are driving a display. The best solution to avoid problems with a GUI watchdog timer is to run without GUI on the GPU in question, which is easily done in Linux by not running X.
I think it is debatable whether purposeful manipulation of a GUI watchdog timer for use as an emergency brake for runaway CUDA kernels is a particularly useful concept. There is no watchdog timer protecting programmers from runaway CPU code either. Admittedly it is usually easier to stop a runaway CPU program, but I have also encountered enough cases where my erroneous code pegged the machine to such a degree that I had trouble regaining control either interactively or by remote access.
Dear CudaaduC and njuffa,
Thank you for your kind replies and suggestions.
It seems like I was right, CUDA still does not currently support a per kernel time out.
I think such a facility (which might even be enabled by default) would be generally
useful. CUDA development still remains far too difficult to hit a mass market.
I prefer not to do it at the host operating system level. I am using a Tesla to do GPGPU
code development so there is no screen connected and X-11 is not used.
BTW Linux has a limit cputime command, which I am using as an ultimate back stop
but it kills my whole application rather than the run away kernel.
One of the irritating problems when using (possibly nested) device functions in the kernel
and coding a solution in the kernel itself, is that CUDA does not support register variables
(eg containing the time the block started) outside the scope of the functions but shared between
them. Such a common variable is easy to do in CPU code but in CUDA it does not seem possible
to do it as a register and so one must either pass it explicitly via the functions’ arguments
or accept the overhead of using a non-register variable.