I have a very similar issue with my RTX 5070 Ti and Ubuntu 25.10
PyTorch model training hangs and is impossible to kill -9
Sometimes this just cause a kernel panic (capslock blinking) and I have to hard reboot the computer
Output of nvidia-smi below is after a reboot, during training the GPU % and mem are ~90%.
The attached nvidia-bug-report.log can hopefully help to diagnose the issue
torch==2.8.0
$ nvidia-smi
Wed Oct 29 20:52:32 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 … Off | 00000000:01:00.0 Off | N/A |
| N/A 31C P4 17W / 65W | 14MiB / 12227MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3465 G /usr/bin/gnome-shell 2MiB |
±----------------------------------------------------------------------------------------+
denis@denis-vector:~/src$ uname -a
Linux denis-vector 6.17.0-6-generic #6-Ubuntu SMP PREEMPT_DYNAMIC Tue Oct 7 13:34:17 UTC 2025 x86_64 GNU/Linux
nvidia-bug-report.log.gz (494.8 KB)