How to cleanly kill a CUDA application

venetis · September 21, 2016, 10:36am

Hello,

I have an application that runs on a GPU for hours or even days, depending on the input. The input is given to me from other people. However, it happens some times that the input contains erroneous values due to carelessness. Hence, I need to kill the application and start it again using the correct input. Furthermore, during development phases, I need again to kill the application due to errors in the kernel (infinite loops, etc).

Hitting Ctrl-C or sending a kill signal to the application does not seem to work. Typically, the server having the GPU gets stuck. Sometimes it doesn’t get stuck, but the application becomes a zombie and the kernel continues to run on the GPU. Rebooting the machine from the command line (either before or after hitting Crtl-C) also doesn’t work every time. In fact, it doesn’t work most of the time as the machine again gets stuck. Since I work remotely on that machine, this is a big issues for me, as I have to ask from someone to reboot physically the machine. And if it is afternoon, I have to wait until the next morning…

So, the question is how to handle this. I have looked around and some suggest to use a signal handler, from where cudaResetDevice() will be called and then the application ends by calling exit(). However, there doesn’t seem to be any conclusive answer in any thread on any forum about whether this works reliably.

So, is there someone who has some definite answer about how to cleanly kill a CUDA application, having tested his/her solution?

venetis · September 24, 2016, 7:02am

Just a bump. Anyone on this? It is really a big issue for me.

Thanks!

Jimmy_Pettersson · September 24, 2016, 12:11pm

So just hitting ctrl-c should cleanly kill your remote CUDA application.

If that doesn’t work I’m wondering if your problem has to do with bugs in the application itself? Buggy kernels (with for example silly INF loops) have certainly frozen up my system from time to time.

If you declare your own SIGINT handler, does it get called?

Is the process listed as “defunct”?

venetis · September 26, 2016, 1:48pm

Unfortunately it doesn’t kill the application. And I am also certain that the system crashes are not due to a bug in the kernel I run, as I made the following test. I ran the kernel with some input and let it finish. Then I run it again with the same input, but when hitting Ctrl-C the system crashed. But even due to the nature of the code and the input it is very difficult to have an infinite loop in the kernel. The input is simply a number of initial values, final values and increments for each loop in the kernel. So, even inspecting the input is enough to see that there is no infinite loop. In any case, I also printed out these values from within the kernel for a small test case and everything is fine.

When the system doesn’t immediately crash after pressing Ctrl-C then yes, the process gets listed as defunct in a ps.

I haven’t installed yet a signal handler. I wanted to know whether this method will actually work, before investing time in implementing it, due to the issues I have read. It is not easy for me to try and implement this, as I will need to have someone constantly around the machine to reboot it, until I make this work correctly. So, I first wanted to know in theory if this is the recommended and actually working method.

LongY · September 26, 2016, 11:39pm

I am not sure if I can answer your question, but at least you can give it a try. I also faced the similar issue about system crash. Here is the link:
[url]https://devtalk.nvidia.com/default/topic/902326/cudamemcpy-takes-more-than-2-seconds-then-driver-crashed-/[/url]

I notice that your application has some input reading. In my application, the system crash was due to using CPU memory mapping when reading input. After I changed it to fread(), the issue was gone.

venetis · September 30, 2016, 12:32pm

It seems that the problem was with the driver of the card. We upgraded to the latest driver (352.99) and the problem seems to have disappeared. Pressing Ctrl-C or using kill-9 now terminates the application and we haven’t seen any crashes of the system (until now).

Topic		Replies	Views
Kernel Interruption in Command Line Application CUDA Programming and Performance	1	7370	July 15, 2011
A process using CUDA gets stuck, then all others get stuck as well - what do I do? CUDA Programming and Performance	4	1437	July 13, 2023
Terminate CUDA kernel which got stuck in an endless loop? Is that possible under linux? CUDA Programming and Performance	9	7545	December 20, 2008
application crash and device memory CUDA Programming and Performance	4	1054	August 17, 2010
How to recover CUDA after the display driver has crashed and recovered(caused by cuda crash)? CUDA Programming and Performance	7	1515	October 23, 2014
Infinite loop in CUDA kernel CUDA Programming and Performance	11	15926	October 25, 2010
How to kill previous thread and reset the GPU memory? Jetson TX1	6	9283	March 1, 2018
Cuda memcheck for ptx file CUDA Programming and Performance	5	814	July 29, 2016
Silent kernel failure CUDA Programming and Performance	25	8199	May 18, 2020
any way to kill a gpu process ? CUDA Programming and Performance	1	6605	July 1, 2009

How to cleanly kill a CUDA application

Related topics