I’m using a GeForce RTX 4070 TI Super in a 24.04 Ubuntu linux box (kernel 6.14.0-29-generic) to test out some small HuggingFace models. Since I log in to the box remotely, it’s headless and I don’t have any type of desktop of X sever processes running. I’m running the models in python 3.11 with CUDA 12.9:
smaug-~> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0
I’m using pytorch for CUDA 12.9: pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129
After a fresh reboot, my nvidia-smi output is:
(base) smaug-~> sudo nvidia-smi
Fri Sep 5 18:23:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
| 32% 31C P0 36W / 285W | 0MiB / 16376MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I was initially concerned that nvidia-smi and nvcc were reporting different CUDA versions but it seems like that’s ok.
At first, I also had some trouble resetting my gpu with nvidia-smi but after disabling DRM reset seems to work fine right after a clean reboot:
(base) smaug-~> sudo nvidia-smi --gpu-reset
GPU 00000000:01:00.0 was successfully reset.
All done.
My problem occurs sporadically when I run a jupyter notebook cell containing my HuggingFace text classifier on a modestly sized data set (400 samples each consisting of under 250 words). The cell that calls the gpu will hang. This is annoying but would be workable except that I cannot seem to kill the python process stuck interacting with the gpu or reset the gpu after this occurs. Specifically:
-
I shut down the jupyter server and the jupyter python kernel closes.
-
top reveals a python process still running and using up 100% of the CPU and that process shrugs off kill -9:
(base) smaug-~> ps aux | grep python
gmessier 2762 35.6 0.0 0 0 ? Rs 18:26 3:59 [python]
gmessier 2958 0.0 0.0 6548 2080 pts/0 S+ 18:38 0:00 grep --color=auto python
(hf-ttrl) (base) smaug-~> sudo kill -9 2762
(hf-ttrl) (base) smaug-~> ps aux | grep python
gmessier 2762 40.7 0.0 0 0 ? Rs 18:26 4:56 [python]
gmessier 2964 0.0 0.0 6548 2064 pts/0 S+ 18:39 0:00 grep --color=auto python
- I guessed that this is likely due to the fact that the process is interacting with the gpu on a fairly deep level but I can’t reset the gpu anymore either using nvidia-smi:
(base) smaug-~> sudo nvidia-smi --gpu-reset
The following GPUs could not be reset:
GPU 00000000:01:00.0: Not Supported
- nvidia-smi’s output is now:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 N/A | N/A |
|ERR! ERR! ERR! N/A / N/A | 772MiB / 16376MiB | N/A Default |
| | | ERR! |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I know that the GPU is supported since reset worked right after reboot before I had this zombie process running. Investigating a bit further, I’ve also observed:
-
After a clean reboot, when I start my jupyter notebook and create my classifier, nvidia-smi shows the jupyter notebook python kernel under “Processes”. I can run the classifier on very short toy examples and, as long as nothing hangs, I can reset the kernel and nvidia-smi shows “No running processes found” right after the kernel restart. That all seems normal so the problem does seem to occur only when a more serious model is run.
-
I’ve tried CUDA 13.0 and the nightly pytorch builds that support 13.0 but the problem is the same.
-
This feels like a memory leak. I’ve been very careful to truncate my input sequence lengths to something my model can handle but, even if I did input an overly long sequence, I would have hoped for some kind of runtime error that didn’t require a reboot.
Any tips or hints very appreciated and apologies if I’ve missed something obvious!