Tesla k40 GPU kernel timeout functionality absent.

Hi

I am using a Tesla k40 GPU. I would like to have an ability to kill a GPU kernel without having to kill the process that triggers that kernel.

It seems that the there are only two options to do that.

  1. Kill the process calling it
    2)Introduce a check for execution in the kernel itself.

However, these options are undesirable for us. Are there any other ways that I might have missed, say in hardware or using Cuda API. If not, are there any plans to handle this in the future.

Thanks

cudaDeviceReset() issued from the owning process (i.e. from the owning application) should kill any and all kernels and eliminate any allocations. It’s not required to kill the process, but this obviously requires programmatic control within the process. It doesn’t require any specifics within the kernel code.

You could also try experimenting with the

nvidia-smi -r

command. Before trying to use it, please read the --help or man nvidia-smi. Other switches are needed with it and there are various other limitations/restrictions, such as requiring root privilege.

Using any of these approaches as a regular GPU management tool is probably not advisable. Killing the GPU activity underlying a process is a rather large crowbar to the process, and it’s doubtful every side effect or corner case can be imagined or tested.

“However, these options are undesirable for us.”

why?