Please provide the following info (tick the boxes after creating this topic):
Submission Type
Bug or Error
Feature Request
Documentation Issue
Question
Other
Workbench Version
Desktop App v0.44.8
CLI v0.21.3
Other
Host Machine operating system and location
Local Windows 11
Local Windows 10
Local macOS
Local Ubuntu 22.04
Remote Ubuntu 22.04
Other
Hello, I am a startup founder working on deep learning models. As off the last few weeks,
I have experienced frequent freezes throughout the training process. The system freezes, and usually cannot be recovered even with REISUB. I have to manually
shutdown the computer. Rarely, the freezes even occur few seconds after booting (not reproducible).
I need to provide an updated deep learning model to my customer, so personal pressure is immense. Thanks for your support…
As I have successfully trained models before, I expect an issue in software packages, ubuntu version or nvidia drivers.
(using ppa:graphics-drivers)
When the issues arose initially, I have tried all available nvidia drivers without success (exception: I had never managed to get nouveau drivers working).
I had then decided to move from ubuntu 22.04 LTS to 23.10. Throughout that phase,
there was a widespread issue with the mutter package (version ~46), which I expected
to be at fault. While it did cause my console to slow down, resolving the mutter-issue according
to the link (Bug #2059847 “Input lag or freezes on Nvidia desktops with X11 a...” : Bugs : mutter package : Ubuntu) did not solve the freezes.
I have then screened through numerous threads and posts, which did not resolve it either.
GPU:
I use an Nvidia RTX 3080ti, have three monitors connected and a dual boot (Windows and Ubuntu).
Currently, I use nvidia-driver-535 with nvidia-firmware-535-535.171.04.
nvidia-smi provides:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 44C P8 63W / 350W | 316MiB / 12288MiB | 20% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2070 G /usr/lib/xorg/Xorg 153MiB |
| 0 N/A N/A 2252 G /usr/bin/gnome-shell 155MiB |
+---------------------------------------------------------------------------------------+
/etc/modprobe.d/nvidia-graphics-drivers-kms.conf:
# This file was generated by nvidia-driver-535
# Set value to 0 to disable modesetting
options nvidia-drm modeset=0
Operating System:
Ubuntu 23.10
kernel 6.5.0-35-generic
gcc --version:
13.2.0
Deep Learning:
I use Python 3.11.4, Pytorch and train with CUDA. There is plenty GPU memory left throughout training.
Logging:
I have extensively logged during training to try pin-pointing where the issue occurs. There is no
obvious trigger. Intuitively, it most frequently occurs between loading data from CPU, pushing it to GPU (CUDA)
and the issue arises somewhere throughout backpropagation.
Some post suggested updating nvidia-settings (PowerMizer) to set ‘Prefer Maximum Performance’,
so I typically use it before I start the training program.
I have searched for (and corrected) a memory leak with ‘watch -n 1 free -m’. There is no
leak, and the swap is unused when freezes occur. I would conclude excluding memory issues.
I use Xorg, wayland is disabled. I had never tried it the other way around.
As somebody suggested, I had tried using a single monitor (no HDMI). This appeared also not to work.
dmesg --level=emerg,alert,crit,err
[ 0.150278] x86/cpu: SGX disabled by BIOS.
According to an online post, I should enable SGX in the BIOS. Could this be related?