Graphic card got stuck/hang randomly while training a neural network, nvidia-smi return error

We used to have a working environment. But recently, our GPU stopped responding randomly (while training a neural network using pytorch lightning with cuda 11.8) and only a reboot of the machine let us work again with the GPU (for a couple of hours).
Anyone having the same issue and know how to solve it?

Our system:

OS: Ubuntu 22.04.2 LTS x86_64

Host: VMware Virtual Platform None

Kernel: 5.15.0-71-generic

CPU: Intel Xeon E5-2680 v4 (8) @ 2.399GHz

GPU: NVIDIA Tesla T4

GPU driver: We installed the Nvidia 525.105.17 driver with the Nvidia repo enabled: Index of /compute/cuda/repos/ubuntu2204/x86_64
We installed the server driver provided by Ubuntu: sudo apt install nvidia-driver-525-server and forced persistent mode on the GPU (the same error occurs without the persistent mode).

Full syslog with the error happening:


May 11 15:05:31 server_hostname nvidia-persistenced: Verbose syslog connection opened

May 11 15:05:31 server_hostname nvidia-persistenced: Now running with user ID 108 and group ID 112

May 11 15:05:31 server_hostname nvidia-persistenced: Started (874)

May 11 15:05:31 server_hostname nvidia-persistenced: device 0000:03:00.0 - registered

May 11 15:05:31 server_hostname kernel: [ 4.770958] nvidia: loading out-of-tree module taints kernel.

May 11 15:05:31 server_hostname kernel: [ 4.771022] nvidia: module license 'NVIDIA' taints kernel.

May 11 15:05:31 server_hostname kernel: [ 4.804683] nvidia: module verification failed: signature and/or required key missing - tainting kernel

May 11 15:05:31 server_hostname kernel: [ 4.842468] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

May 11 15:05:31 server_hostname kernel: [ 5.019714] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.105.17 Tue Mar 28 22:18:37 UTC 2023

May 11 15:05:31 server_hostname kernel: [ 5.028915] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver

May 11 15:05:31 server_hostname kernel: [ 5.032103] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1

May 11 15:05:31 server_hostname kernel: [ 13.617243] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.

May 11 15:05:31 server_hostname kernel: [ 13.634578] nvidia-uvm: Loaded the UVM driver, major device number 234.

May 11 15:05:31 server_hostname kernel: [ 14.253488] audit: type=1400 audit(1683817503.080:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=814 comm="apparmor_parser"

May 11 15:05:31 server_hostname kernel: [ 14.253497] audit: type=1400 audit(1683817503.080:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=814 comm="apparmor_parser"

May 11 15:05:33 server_hostname nvidia-persistenced: device 0000:03:00.0 - persistence mode enabled.

May 11 15:05:33 server_hostname nvidia-persistenced: device 0000:03:00.0 - NUMA memory onlined.

May 11 15:05:33 server_hostname nvidia-persistenced: Local RPC services initialized

May 11 23:25:44 server_hostname kernel: [30056.771730] os_dump_stack+0xe/0x14 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.773628] _nv010932rm+0x3ab/0x430 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.775639] 00000000b589e781: ffffffffc0c9842b (_nv010932rm+0x3ab/0x430 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.777597] 00000000cb6dbac2: ffffffffc0c9842b (_nv010932rm+0x3ab/0x430 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.779569] 00000000fa658a30: ffffffffc126a117 (os_dump_stack+0xe/0x14 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.781405] 00000000b875d9a7: ffffffffc0c9842b (_nv010932rm+0x3ab/0x430 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.783301] 000000001fb19198: ffffffffc0f11631 (_nv010862rm+0x61/0x300 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.785097] 0000000008dea031: ffffffffc0f3561c (_nv040799rm+0x42c/0x5b0 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.786937] 00000000e91fd6d9: ffffffffc0ae5413 (_nv032631rm+0xc3/0x1c0 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.788858] 00000000e60cf3c6: ffffffffc0ac67b9 (_nv043082rm+0x139/0x220 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.790767] 0000000058544e53: ffffffffc3a901e0 (_nv000442rm+0xad4/0xfffffffffd7dc8f4 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.792605] 00000000dbe80647: ffffffffc0f651c1 (_nv040065rm+0x171/0x180 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.794478] 0000000060df28f0: ffffffffc3a901e0 (_nv000442rm+0xad4/0xfffffffffd7dc8f4 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.796306] 00000000d440c31e: ffffffffc0f625de (_nv041987rm+0x1ee/0x2f0 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.798087] 00000000003c5ecf: ffffffffc079f3c0 (_nv012482rm+0x550/0x5f0 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.799958] 00000000fedde9e8: ffffffffc079fa19 (_nv040196rm+0x29/0x30 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.801797] 0000000027c4702c: ffffffffc3a8ff18 (_nv000442rm+0x80c/0xfffffffffd7dc8f4 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.803636] 00000000234d19e2: ffffffffc3a8ffd0 (_nv000442rm+0x8c4/0xfffffffffd7dc8f4 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.805478] 00000000d9573e5b: ffffffffc1167517 (rm_get_gpu_numa_info+0x107/0x250 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.807150] 000000008a671bab: ffffffffc3ac3bf0 (_nv039217rm+0x90/0xfffffffffd7a84a0 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.809002] 00000000f1611511: ffffffffc06fc8a5 (nvidia_ioctl+0x315/0x830 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.810873] 0000000067fd45a9: ffffffffc070f145 (nvidia_frontend_unlocked_ioctl+0x55/0x90 [nvidia])

May 11 23:25:44 server_hostname kernel: [30056.812856] ? _nv010862rm+0x61/0x300 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.814666] ? _nv040799rm+0x42c/0x5b0 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.816449] ? _nv032631rm+0xc3/0x1c0 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.818362] ? _nv043082rm+0x139/0x220 [nvidia

May 11 23:25:44 server_hostname kernel: [30056.820265] ? _nv040065rm+0x171/0x180 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.822039] ? _nv041987rm+0x1ee/0x2f0 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.823813] ? _nv012482rm+0x550/0x5f0 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.825671] ? _nv040196rm+0x29/0x30 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.827524] ? rm_get_gpu_numa_info+0x107/0x250 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.829146] ? nvidia_ioctl+0x315/0x830 [nvidia]

May 11 23:25:44 server_hostname kernel: [30056.830987] ? nvidia_frontend_unlocked_ioctl+0x55/0x90 [nvidia]

nvidia-smi returns:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  ERR!                On   | 00000000:03:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |    217MiB / 15360MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43896      C   ray::ImplicitFunc.train           212MiB |
+-----------------------------------------------------------------------------+