per kernel timeout

wlangdon · December 11, 2015, 1:16pm

I guess this has been asked before, but hopefully not recently:
How are we doing with getting nVidia to implement a per kernel timeout?
Ie I want to say “this kernel should not take more than one millisecond,
if its still running after more than one millisecond the GPU
must abort it”. The point being, if (when:-( I create an infinite loop
in one of my kernels, I would like a software way of stopping it. Power
cycling the GPU is often not convenient.

NB I want the CUDA kernel to stop.
This is not the same as having my X-windows system abort!!

Many thanks

Bill

Robert_Crovella · December 11, 2015, 1:41pm

If you’re simply wanting to cover the case where you’ve got an infinite loop (unintended) you can use cudaDeviceReset() to start over in a programmatic way.

So launch the kernel.
record an event after the kernel.
start a host based timing method (e.g. record a start time)
go into a loop that is monitoring the event via cudaEventQuery() as well as the elapsed time

If the polling loop on cudaEventQuery exceeds your timeout value, then issue a cudaDeviceReset() and do whatever recovery you wish at that point.

You can even omit the cudaDeviceReset() if you simply intend to inform the user of a fault and exit.

The cuda runtime will detect the termination of the host process associated with the kernel and “clean up” the device.

wlangdon · December 11, 2015, 2:47pm

Dear Bob,
Thanks for the suggestions.
I guess I am used to the idea that cudaDeviceReset() is GPU specific
but these days or with any future GPU I am likely to use
cudaDeviceReset() should be fine.
I recall that with old GPUs (or may be old versions of CUDA) cudaDeviceReset()
was not always fool proof.

I guess I do not need the host to poll but can wait on a suitable host signal?
(most of my stuff is on time-shared servers,
so, if possible, I do not want to upset other users).

I was not sure of your last comment.
This is an alternative right?
I can use cudaDeviceReset() without the need to kill my host side process?
Did you mean if my host side process dies, the CUDA runtime will
invoke cudaDeviceReset() (or something equivelent) as it “cleans up” the GPU?

Thanks again
Bill

Robert_Crovella · December 11, 2015, 2:58pm

cudaDeviceReset() (like all CUDA runtime API calls) should only affect your context - it should have no effect on other users of the same GPU (if you have users sharing a single GPU)

This is not the same as using nvidia-smi to issue a reset to the device. Perhaps that is what you are thinking of.

Yes, as an alternative, if you don’t want to issue a cudaDeviceReset(), but do intend to exit the application (thus terminating the host process) the cuda runtime will do an operation like cudaDeviceReset on your GPU context anyway.

wlangdon · December 11, 2015, 5:44pm

Thanks Bob:-)

Topic		Replies	Views
Timeout on long-running CUDA calls - possible? CUDA Programming and Performance	3	564	May 26, 2017
Avoiding driver timeouts How do I avoid driver timeouts? CUDA Programming and Performance	7	7468	July 1, 2010
CUDA Timeout? CUDA Programming and Performance	7	27665	December 19, 2011
cudaErrorLaunchTimeout error - how to repair after it happens ? CUDA Programming and Performance	1	1505	November 21, 2010
User Request kernel timout CUDA Programming and Performance	4	966	January 5, 2015
"dead" reset cuda device when debugging CUDA-GDB	4	2812	April 3, 2014
Terminate CUDA kernel which got stuck in an endless loop? Is that possible under linux? CUDA Programming and Performance	9	7553	December 20, 2008
How to kill all the executions on Nvidia Jetson AGX Orin in Linux kernel directly? Jetson AGX Orin hw , kernel	9	617	August 16, 2023
Allow kernel to wait for completion of gpu code CUDA Programming and Performance	1	2194	August 19, 2009
How can I halt an entire kernel? CUDA Programming and Performance	17	4713	July 3, 2009

per kernel timeout

Related topics