per kernel timeout

I guess this has been asked before, but hopefully not recently:
How are we doing with getting nVidia to implement a per kernel timeout?
Ie I want to say “this kernel should not take more than one millisecond,
if its still running after more than one millisecond the GPU
must abort it”. The point being, if (when:-( I create an infinite loop
in one of my kernels, I would like a software way of stopping it. Power
cycling the GPU is often not convenient.

NB I want the CUDA kernel to stop.
This is not the same as having my X-windows system abort!!

Many thanks

Bill

If you’re simply wanting to cover the case where you’ve got an infinite loop (unintended) you can use cudaDeviceReset() to start over in a programmatic way.

So launch the kernel.
record an event after the kernel.
start a host based timing method (e.g. record a start time)
go into a loop that is monitoring the event via cudaEventQuery() as well as the elapsed time

If the polling loop on cudaEventQuery exceeds your timeout value, then issue a cudaDeviceReset() and do whatever recovery you wish at that point.

You can even omit the cudaDeviceReset() if you simply intend to inform the user of a fault and exit.

The cuda runtime will detect the termination of the host process associated with the kernel and “clean up” the device.

Dear Bob,
Thanks for the suggestions.
I guess I am used to the idea that cudaDeviceReset() is GPU specific
but these days or with any future GPU I am likely to use
cudaDeviceReset() should be fine.
I recall that with old GPUs (or may be old versions of CUDA) cudaDeviceReset()
was not always fool proof.

I guess I do not need the host to poll but can wait on a suitable host signal?
(most of my stuff is on time-shared servers,
so, if possible, I do not want to upset other users).

I was not sure of your last comment.
This is an alternative right?
I can use cudaDeviceReset() without the need to kill my host side process?
Did you mean if my host side process dies, the CUDA runtime will
invoke cudaDeviceReset() (or something equivelent) as it “cleans up” the GPU?

Thanks again
Bill

cudaDeviceReset() (like all CUDA runtime API calls) should only affect your context - it should have no effect on other users of the same GPU (if you have users sharing a single GPU)

This is not the same as using nvidia-smi to issue a reset to the device. Perhaps that is what you are thinking of.

Yes, as an alternative, if you don’t want to issue a cudaDeviceReset(), but do intend to exit the application (thus terminating the host process) the cuda runtime will do an operation like cudaDeviceReset on your GPU context anyway.

Thanks Bob:-)