A process using CUDA gets stuck, then all others get stuck as well - what do I do?

epk · July 13, 2023, 10:15am

I’m writing some program using CUDA CUDA 12.1, running on a Linux system (Devuan Daedalus, kernel version 6.1.27).

For some reason (which may be a bug of mine, although I kind of doubt it) - the process gets stuck at some point. Sending it SIGINT, SIGTERM or SIGKILL has no effect. The details of what this process does shouldn’t really matter, but - it doesn’t do file I/O, it doesn’t use the network, it doesn’t use any other peripherals - it just uses CUDA APIs (specifically, execution graphs), does some computation in-memory, and prints messages to its standard output.

So, my first question: How can I kill such a process (other than by rebooting the machine)?

Now, after this process gets stuck - any process using CUDA APIs seems to also get stuck, (almost) immediately when starting to run.

Thus, my second question: Can I avoid other processes getting stuck?

Robert_Crovella · July 13, 2023, 5:57pm

I have seen other similar questions. here is an example. There are others. I don’t have any further suggestions to add other than what I have shared already. I doubt there is a precise, deterministic, guaranteed method to fix this observation in every imaginable case, other than a reboot. As you can see from that other thread, there may be other processes that need to be killed before the GPU will recover. Until the GPU is recovered, by reboot or some other method, I don’t know of a specific method to guarantee that other processes using that GPU will behave normally.

epk · July 13, 2023, 8:36pm

@Robert_Crovella : I’d settle for an imprecise, not-guaranteed method which works in some cases :-)

What I’ve tried:

Listed all processes using any GPU devices, using lsof /dev/nvidia* ; tried kill -KILL on them, separately and together.
Used nvidia-smi --gpu-reset -i with the relevant device ID - that can’t work even in principle, since it refuses to reset a GPU while a process is using it.

Robert_Crovella · July 13, 2023, 9:13pm

I don’t know of a method that works in every case, therefore I cannot possibly tell you a method that will work in your case (other than reboot).

Looking at the other thread I linked, anecdotally:

most of the suggestions have to do with killing processes
at least one or several reports of success are there

Between that thread and the SO thread it links (which seems to be mostly duplicative) there are a laundry list of things to try. If you’ve tried those, then I don’t know what to do in your case and don’t have further suggestions.

epk · July 13, 2023, 10:19pm

So, I think I’ll try filing a bug. The NVIDIA modules should not lock up like this (assuming you don’t do something underhanded like manipulating kernel memory space etc.), and processes should be killable.

Topic		Replies	Views
How to cleanly kill a CUDA application CUDA Programming and Performance	5	4687	September 30, 2016
stuck CUDA program how to restart GPU when CUDA gets stuck CUDA Programming and Performance	0	1318	August 10, 2010
any way to kill a gpu process ? CUDA Programming and Performance	1	6605	July 1, 2009
Computation crash = stuck at 574mhz CUDA Programming and Performance	9	1276	August 4, 2015
Is there any way to implement RR scheduling algorithm on the GPU? CUDA Programming and Performance	1	315	August 16, 2019
Trouble killing CUDA processes? CUDA Programming and Performance	1	6075	July 8, 2008
Terminate CUDA kernel which got stuck in an endless loop? Is that possible under linux? CUDA Programming and Performance	9	7545	December 20, 2008
Kernel Interruption in Command Line Application CUDA Programming and Performance	1	7370	July 15, 2011
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3459	March 10, 2011
Reset dedicated GPU after it gets stuck Linux cuda , linux , nvidia-smi	7	17511	August 30, 2023

A process using CUDA gets stuck, then all others get stuck as well - what do I do?

Related topics