A process using CUDA gets stuck, then all others get stuck as well - what do I do?

I’m writing some program using CUDA CUDA 12.1, running on a Linux system (Devuan Daedalus, kernel version 6.1.27).

For some reason (which may be a bug of mine, although I kind of doubt it) - the process gets stuck at some point. Sending it SIGINT, SIGTERM or SIGKILL has no effect. The details of what this process does shouldn’t really matter, but - it doesn’t do file I/O, it doesn’t use the network, it doesn’t use any other peripherals - it just uses CUDA APIs (specifically, execution graphs), does some computation in-memory, and prints messages to its standard output.

So, my first question: How can I kill such a process (other than by rebooting the machine)?

Now, after this process gets stuck - any process using CUDA APIs seems to also get stuck, (almost) immediately when starting to run.

Thus, my second question: Can I avoid other processes getting stuck?

I have seen other similar questions. here is an example. There are others. I don’t have any further suggestions to add other than what I have shared already. I doubt there is a precise, deterministic, guaranteed method to fix this observation in every imaginable case, other than a reboot. As you can see from that other thread, there may be other processes that need to be killed before the GPU will recover. Until the GPU is recovered, by reboot or some other method, I don’t know of a specific method to guarantee that other processes using that GPU will behave normally.

@Robert_Crovella : I’d settle for an imprecise, not-guaranteed method which works in some cases :-)

What I’ve tried:

  • Listed all processes using any GPU devices, using lsof /dev/nvidia* ; tried kill -KILL on them, separately and together.
  • Used nvidia-smi --gpu-reset -i with the relevant device ID - that can’t work even in principle, since it refuses to reset a GPU while a process is using it.

I don’t know of a method that works in every case, therefore I cannot possibly tell you a method that will work in your case (other than reboot).

Looking at the other thread I linked, anecdotally:

  • most of the suggestions have to do with killing processes
  • at least one or several reports of success are there

Between that thread and the SO thread it links (which seems to be mostly duplicative) there are a laundry list of things to try. If you’ve tried those, then I don’t know what to do in your case and don’t have further suggestions.

So, I think I’ll try filing a bug. The NVIDIA modules should not lock up like this (assuming you don’t do something underhanded like manipulating kernel memory space etc.), and processes should be killable.