Frequent Freezes during CUDA Training on Ubuntu 23.10

Please provide the following info (tick the boxes after creating this topic):

Submission Type
Bug or Error
Feature Request
Documentation Issue
Question
Other

Workbench Version
Desktop App v0.44.8
CLI v0.21.3
Other

Host Machine operating system and location
Local Windows 11
Local Windows 10
Local macOS
Local Ubuntu 22.04
Remote Ubuntu 22.04
Other

Hello, I am a startup founder working on deep learning models. As off the last few weeks,
I have experienced frequent freezes throughout the training process. The system freezes, and usually cannot be recovered even with REISUB. I have to manually
shutdown the computer. Rarely, the freezes even occur few seconds after booting (not reproducible).

I need to provide an updated deep learning model to my customer, so personal pressure is immense. Thanks for your support…

As I have successfully trained models before, I expect an issue in software packages, ubuntu version or nvidia drivers.
(using ppa:graphics-drivers)

When the issues arose initially, I have tried all available nvidia drivers without success (exception: I had never managed to get nouveau drivers working).
I had then decided to move from ubuntu 22.04 LTS to 23.10. Throughout that phase,
there was a widespread issue with the mutter package (version ~46), which I expected
to be at fault. While it did cause my console to slow down, resolving the mutter-issue according
to the link (Bug #2059847 “Input lag or freezes on Nvidia desktops with X11 a...” : Bugs : mutter package : Ubuntu) did not solve the freezes.
I have then screened through numerous threads and posts, which did not resolve it either.

GPU:
I use an Nvidia RTX 3080ti, have three monitors connected and a dual boot (Windows and Ubuntu).
Currently, I use nvidia-driver-535 with nvidia-firmware-535-535.171.04.

nvidia-smi provides:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     Off | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P8              63W / 350W |    316MiB / 12288MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2070      G   /usr/lib/xorg/Xorg                          153MiB |
|    0   N/A  N/A      2252      G   /usr/bin/gnome-shell                        155MiB |
+---------------------------------------------------------------------------------------+

/etc/modprobe.d/nvidia-graphics-drivers-kms.conf:
# This file was generated by nvidia-driver-535
# Set value to 0 to disable modesetting
options nvidia-drm modeset=0

Operating System:
Ubuntu 23.10
kernel 6.5.0-35-generic

gcc --version:
13.2.0

Deep Learning:
I use Python 3.11.4, Pytorch and train with CUDA. There is plenty GPU memory left throughout training.

Logging:
I have extensively logged during training to try pin-pointing where the issue occurs. There is no
obvious trigger. Intuitively, it most frequently occurs between loading data from CPU, pushing it to GPU (CUDA)
and the issue arises somewhere throughout backpropagation.

Some post suggested updating nvidia-settings (PowerMizer) to set ‘Prefer Maximum Performance’,
so I typically use it before I start the training program.

I have searched for (and corrected) a memory leak with ‘watch -n 1 free -m’. There is no
leak, and the swap is unused when freezes occur. I would conclude excluding memory issues.

I use Xorg, wayland is disabled. I had never tried it the other way around.

As somebody suggested, I had tried using a single monitor (no HDMI). This appeared also not to work.

dmesg --level=emerg,alert,crit,err
[ 0.150278] x86/cpu: SGX disabled by BIOS.

According to an online post, I should enable SGX in the BIOS. Could this be related?

Hi - Are you using Workbench for this training or are you doing it another way?

Hi, thanks for the response. I do not use Workbench. I have looked at it - while it seems interesting, it appears to be unrelated. I do not plan on training my systems on nvidia servers.
I might look into it in the future.

I use VSCode, have a local GPU and train on a workstation. All of the environment (GIT, etc) is already setup.

max

all good.

if you aren’t using workbench, then this is the wrong forum to post in.

you could try posting here: Latest AI & Data Science/Other Products topics - NVIDIA Developer Forums