Unkillable PyTorch CUDA 12.9 NVIDIA Driver 580 Process on Ubuntu 24.04

I’m using a GeForce RTX 4070 TI Super in a 24.04 Ubuntu linux box (kernel 6.14.0-29-generic) to test out some small HuggingFace models. Since I log in to the box remotely, it’s headless and I don’t have any type of desktop of X sever processes running. I’m running the models in python 3.11 with CUDA 12.9:

smaug-~> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

I’m using pytorch for CUDA 12.9: pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129

After a fresh reboot, my nvidia-smi output is:

(base) smaug-~> sudo nvidia-smi
Fri Sep  5 18:23:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| 32%   31C    P0             36W /  285W |       0MiB /  16376MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I was initially concerned that nvidia-smi and nvcc were reporting different CUDA versions but it seems like that’s ok.

At first, I also had some trouble resetting my gpu with nvidia-smi but after disabling DRM reset seems to work fine right after a clean reboot:

(base) smaug-~> sudo nvidia-smi --gpu-reset
GPU 00000000:01:00.0 was successfully reset.
All done.

My problem occurs sporadically when I run a jupyter notebook cell containing my HuggingFace text classifier on a modestly sized data set (400 samples each consisting of under 250 words). The cell that calls the gpu will hang. This is annoying but would be workable except that I cannot seem to kill the python process stuck interacting with the gpu or reset the gpu after this occurs. Specifically:

  • I shut down the jupyter server and the jupyter python kernel closes.

  • top reveals a python process still running and using up 100% of the CPU and that process shrugs off kill -9:

(base) smaug-~> ps aux | grep python
gmessier    2762 35.6  0.0      0     0 ?        Rs   18:26   3:59 [python]
gmessier    2958  0.0  0.0   6548  2080 pts/0    S+   18:38   0:00 grep --color=auto python
(hf-ttrl) (base) smaug-~> sudo kill -9 2762
(hf-ttrl) (base) smaug-~> ps aux | grep python
gmessier    2762 40.7  0.0      0     0 ?        Rs   18:26   4:56 [python]
gmessier    2964  0.0  0.0   6548  2064 pts/0    S+   18:39   0:00 grep --color=auto python
  • I guessed that this is likely due to the fact that the process is interacting with the gpu on a fairly deep level but I can’t reset the gpu anymore either using nvidia-smi:
(base) smaug-~> sudo nvidia-smi --gpu-reset
The following GPUs could not be reset:
  GPU 00000000:01:00.0: Not Supported
  • nvidia-smi’s output is now:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 N/A |                  N/A |
|ERR!  ERR! ERR!             N/A  /  N/A  |     772MiB /  16376MiB |     N/A      Default |
|                                         |                        |                 ERR! |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I know that the GPU is supported since reset worked right after reboot before I had this zombie process running. Investigating a bit further, I’ve also observed:

  • After a clean reboot, when I start my jupyter notebook and create my classifier, nvidia-smi shows the jupyter notebook python kernel under “Processes”. I can run the classifier on very short toy examples and, as long as nothing hangs, I can reset the kernel and nvidia-smi shows “No running processes found” right after the kernel restart. That all seems normal so the problem does seem to occur only when a more serious model is run.

  • I’ve tried CUDA 13.0 and the nightly pytorch builds that support 13.0 but the problem is the same.

  • This feels like a memory leak. I’ve been very careful to truncate my input sequence lengths to something my model can handle but, even if I did input an overly long sequence, I would have hoped for some kind of runtime error that didn’t require a reboot.

Any tips or hints very appreciated and apologies if I’ve missed something obvious!

I have a very similar issue with my RTX 5070 Ti and Ubuntu 25.10

PyTorch model training hangs and is impossible to kill -9

Sometimes this just cause a kernel panic (capslock blinking) and I have to hard reboot the computer

Output of nvidia-smi below is after a reboot, during training the GPU % and mem are ~90%.

The attached nvidia-bug-report.log can hopefully help to diagnose the issue

torch==2.8.0

$ nvidia-smi
Wed Oct 29 20:52:32 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 … Off | 00000000:01:00.0 Off | N/A |
| N/A 31C P4 17W / 65W | 14MiB / 12227MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3465 G /usr/bin/gnome-shell 2MiB |
±----------------------------------------------------------------------------------------+
denis@denis-vector:~/src$ uname -a
Linux denis-vector 6.17.0-6-generic #6-Ubuntu SMP PREEMPT_DYNAMIC Tue Oct 7 13:34:17 UTC 2025 x86_64 GNU/Linux

nvidia-bug-report.log.gz (494.8 KB)

As an update, I’m still working through this but I’m starting to suspect that my problem is hardware and power connection related.

After struggling with the 4070 Ti, I switched it out for an old TitanX and all the problems went away. That made me think that the 4070 Ti was bad and just this week I upgraded to a 5070 Ti (same card as you).

The 5070 Ti wants three PCIe power connectors. My power supply is a Corsair 860W with 6 PCIe outputs so it should be fine to run the new card. I thought that I had two connectors and only needed to buy one new cable but I realize my current cable is just a single PCIe power cable with two outputs on a splitter.

My 4070 Ti wanted two PCIe power connections and I connected both of them to that splitter. I’m now starting to suspect that I was starving it for power since I was really only running it off one PCIe output from my power supply.

So, I’m now waiting for my new cables to arrive and will let you know if I get my 5070 Ti working. However, make sure you’re powering your card with three dedicated PCIe cables connected to three different PCIe outputs on your power supply and that your power supply is rated high enough.

Confirmed that it was insufficient power to the GPU because of that split PCIe cable (sigh). My 5070 Ti is working fine with:

  • Ubuntu 24.04
  • Drivers installed using: ubuntu-drivers install --gpgpu nvidia:580-server-open
  • Pytorch installed using: pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
  • Three dedicated and fully separate PCIe power connections.