I have an application that runs on a GPU for hours or even days, depending on the input. The input is given to me from other people. However, it happens some times that the input contains erroneous values due to carelessness. Hence, I need to kill the application and start it again using the correct input. Furthermore, during development phases, I need again to kill the application due to errors in the kernel (infinite loops, etc).
Hitting Ctrl-C or sending a kill signal to the application does not seem to work. Typically, the server having the GPU gets stuck. Sometimes it doesn’t get stuck, but the application becomes a zombie and the kernel continues to run on the GPU. Rebooting the machine from the command line (either before or after hitting Crtl-C) also doesn’t work every time. In fact, it doesn’t work most of the time as the machine again gets stuck. Since I work remotely on that machine, this is a big issues for me, as I have to ask from someone to reboot physically the machine. And if it is afternoon, I have to wait until the next morning…
So, the question is how to handle this. I have looked around and some suggest to use a signal handler, from where cudaResetDevice() will be called and then the application ends by calling exit(). However, there doesn’t seem to be any conclusive answer in any thread on any forum about whether this works reliably.
So, is there someone who has some definite answer about how to cleanly kill a CUDA application, having tested his/her solution?
So just hitting ctrl-c should cleanly kill your remote CUDA application.
If that doesn’t work I’m wondering if your problem has to do with bugs in the application itself? Buggy kernels (with for example silly INF loops) have certainly frozen up my system from time to time.
If you declare your own SIGINT handler, does it get called?
Unfortunately it doesn’t kill the application. And I am also certain that the system crashes are not due to a bug in the kernel I run, as I made the following test. I ran the kernel with some input and let it finish. Then I run it again with the same input, but when hitting Ctrl-C the system crashed. But even due to the nature of the code and the input it is very difficult to have an infinite loop in the kernel. The input is simply a number of initial values, final values and increments for each loop in the kernel. So, even inspecting the input is enough to see that there is no infinite loop. In any case, I also printed out these values from within the kernel for a small test case and everything is fine.
When the system doesn’t immediately crash after pressing Ctrl-C then yes, the process gets listed as defunct in a ps.
I haven’t installed yet a signal handler. I wanted to know whether this method will actually work, before investing time in implementing it, due to the issues I have read. It is not easy for me to try and implement this, as I will need to have someone constantly around the machine to reboot it, until I make this work correctly. So, I first wanted to know in theory if this is the recommended and actually working method.
I notice that your application has some input reading. In my application, the system crash was due to using CPU memory mapping when reading input. After I changed it to fread(), the issue was gone.
It seems that the problem was with the driver of the card. We upgraded to the latest driver (352.99) and the problem seems to have disappeared. Pressing Ctrl-C or using kill-9 now terminates the application and we haven’t seen any crashes of the system (until now).