Random Xid 62 error on ML workloads - Titan RTX

CPU: Intel Core i9-9900X @ 20x 4.5GHz
GRAPHICS: TITAN RTX

Driver Version: 440.44
CUDA Version: 10.1.105

Linux 5.3.0-7629-generic #31~1581628854~18.04~2db8a7a-Ubuntu SMP Fri Feb 14 19:56:05 UTC x86_64 GNU/Linux

Hello, at our university we have a computer used by several researchers, and after around 1 year the GPU started to hang when running some processes.

The GPU would throw a Xid 62 error and some processes would continue to work normally. However, accessing the nvidia-smi, or running any script that interacts with the gpu would hang and take a long time (or fail) to execute. Using 100% of 1 cpu thread in kernel.

# nvidia-smi after Xid 62
Thu Jul  2 16:10:10 2020                                                       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           Off  | 00000000:65:00.0 Off |                  N/A |
|ERR!   53C    P2   ERR! / 280W |   9900MiB / 24217MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1980      G   /usr/bin/gnome-shell                           6MiB |
|    0      5428      C   python                                      1149MiB |
|    0     32009      C   ...as/anaconda3/envs/eg/bin/python          8733MiB |
+-----------------------------------------------------------------------------+
# /var/log/kern.log (kept the NMI backtrace from the nvidia bug report tool just if it's helpful)
Jul  2 16:10:09 ilu-server kernel: [2264573.859627] NVRM: Xid (PCI:0000:65:00): 62, pid=32009, 203c(3090) 00000000 00000000
Jul  2 16:29:42 ilu-server kernel: [2265746.751325] sysrq: Show backtrace of all active CPUs
Jul  2 16:29:42 ilu-server kernel: [2265746.751330] NMI backtrace for cpu 3
Jul  2 16:29:42 ilu-server kernel: [2265746.751333] CPU: 3 PID: 9104 Comm: nvidia-bug-repo Tainted: P           OE     5.3.0-7629-generic #31~1581628854~18.04~2db8a7a-Ubuntu
Jul  2 16:29:42 ilu-server kernel: [2265746.751335] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X299 Extreme4, BIOS P1.50 10/22/2018
Jul  2 16:29:42 ilu-server kernel: [2265746.751335] Call Trace:
Jul  2 16:29:42 ilu-server kernel: [2265746.751345]  dump_stack+0x6d/0x95
Jul  2 16:29:42 ilu-server kernel: [2265746.751348]  nmi_cpu_backtrace+0x94/0xa0
Jul  2 16:29:42 ilu-server kernel: [2265746.751351]  ? lapic_can_unplug_cpu+0xb0/0xb0
Jul  2 16:29:42 ilu-server kernel: [2265746.751354]  nmi_trigger_cpumask_backtrace+0xe7/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.751356]  arch_trigger_cpumask_backtrace+0x19/0x20
Jul  2 16:29:42 ilu-server kernel: [2265746.751359]  sysrq_handle_showallcpus+0x17/0x20
Jul  2 16:29:42 ilu-server kernel: [2265746.751361]  __handle_sysrq+0xa6/0x170
Jul  2 16:29:42 ilu-server kernel: [2265746.751363]  write_sysrq_trigger+0x34/0x40
Jul  2 16:29:42 ilu-server kernel: [2265746.751367]  proc_reg_write+0x3e/0x60
Jul  2 16:29:42 ilu-server kernel: [2265746.751370]  __vfs_write+0x1b/0x40
Jul  2 16:29:42 ilu-server kernel: [2265746.751371]  vfs_write+0xb1/0x1a0
Jul  2 16:29:42 ilu-server kernel: [2265746.751374]  ksys_write+0xa7/0xe0
Jul  2 16:29:42 ilu-server kernel: [2265746.751377]  __x64_sys_write+0x1a/0x20
Jul  2 16:29:42 ilu-server kernel: [2265746.751381]  do_syscall_64+0x5a/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.751385]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul  2 16:29:42 ilu-server kernel: [2265746.751387] RIP: 0033:0x7f3afa35b154
Jul  2 16:29:42 ilu-server kernel: [2265746.751390] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 b1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
Jul  2 16:29:42 ilu-server kernel: [2265746.751391] RSP: 002b:00007ffe84c74488 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jul  2 16:29:42 ilu-server kernel: [2265746.751393] RAX: ffffffffffffffda RBX: 0000559c8938ad40 RCX: 00007f3afa35b154
Jul  2 16:29:42 ilu-server kernel: [2265746.751394] RDX: 0000000000000002 RSI: 0000559c8938ad40 RDI: 0000000000000001
Jul  2 16:29:42 ilu-server kernel: [2265746.751395] RBP: 0000000000000002 R08: 0000559c8938e394 R09: 0000559c8938e082
Jul  2 16:29:42 ilu-server kernel: [2265746.751397] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001
Jul  2 16:29:42 ilu-server kernel: [2265746.751398] R13: 0000000000000002 R14: 7fffffffffffffff R15: 0000559c893857b5
Jul  2 16:29:42 ilu-server kernel: [2265746.751402] Sending NMI from CPU 3 to CPUs 0-2,4-19:
Jul  2 16:29:42 ilu-server kernel: [2265746.751414] NMI backtrace for cpu 0
Jul  2 16:29:42 ilu-server kernel: [2265746.751415] CPU: 0 PID: 32009 Comm: python Tainted: P           OE     5.3.0-7629-generic #31~1581628854~18.04~2db8a7a-Ubuntu
Jul  2 16:29:42 ilu-server kernel: [2265746.751415] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X299 Extreme4, BIOS P1.50 10/22/2018
Jul  2 16:29:42 ilu-server kernel: [2265746.751415] RIP: 0033:0x5617633b751a
Jul  2 16:29:42 ilu-server kernel: [2265746.751416] Code: 24 48 4c 89 5c 24 08 85 ff 0f 89 81 44 00 00 48 89 6c 24 08 49 89 f1 4c 8b 7b 48 c6 83 84 00 00 00 01 48 c7 43 48 00 00 00 00 <41> f7 41 20 a0 02 00 00 0f 85 98 43 00 00 44 8b 74 24 10 41 bb ff
Jul  2 16:29:42 ilu-server kernel: [2265746.751416] RSP: 002b:00007ffc728405e0 EFLAGS: 00000286
Jul  2 16:29:42 ilu-server kernel: [2265746.751416] RAX: 0000000000000056 RBX: 00007f8004abf6c8 RCX: 00007f8007e142e8
Jul  2 16:29:42 ilu-server kernel: [2265746.751417] RDX: 000056176508a370 RSI: 00007f8007e151e0 RDI: ffffffffffffffff
Jul  2 16:29:42 ilu-server kernel: [2265746.751417] RBP: 00007f8007e104b0 R08: 0000000000000000 R09: 00007f8007e151e0
Jul  2 16:29:42 ilu-server kernel: [2265746.751417] R10: 00007f8004abf840 R11: 00007f8007e10490 R12: 000056176508a370
Jul  2 16:29:42 ilu-server kernel: [2265746.751417] R13: 0000000000000000 R14: 00007f806d8faf10 R15: 00007f8004abf840
Jul  2 16:29:42 ilu-server kernel: [2265746.751417] FS:  00007f80a202a740 GS:  0000000000000000
Jul  2 16:29:42 ilu-server kernel: [2265746.751509] NMI backtrace for cpu 1 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.751610] NMI backtrace for cpu 2
Jul  2 16:29:42 ilu-server kernel: [2265746.751611] CPU: 2 PID: 26142 Comm: sumo Tainted: P           OE     5.3.0-7629-generic #31~1581628854~18.04~2db8a7a-Ubuntu
Jul  2 16:29:42 ilu-server kernel: [2265746.751611] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X299 Extreme4, BIOS P1.50 10/22/2018
Jul  2 16:29:42 ilu-server kernel: [2265746.751611] RIP: 0033:0x555a3dcbb7bb
Jul  2 16:29:42 ilu-server kernel: [2265746.751612] Code: 8b 8f 38 01 00 00 4d 8b 01 4d 3b 41 08 74 41 49 8b 40 30 49 8b 48 38 66 0f ef c0 48 83 c0 08 48 39 c8 74 17 0f 1f 00 48 8b 10 <48> 83 c0 08 48 39 c8 f2 0f 58 82 c0 01 00 00 75 ec 49 39 30 f2 41
Jul  2 16:29:42 ilu-server kernel: [2265746.751612] RSP: 002b:00007ffd5f2cc7f8 EFLAGS: 00000287
Jul  2 16:29:42 ilu-server kernel: [2265746.751612] RAX: 0000555a4e028538 RBX: 0000000000000000 RCX: 0000555a4e028548
Jul  2 16:29:42 ilu-server kernel: [2265746.751613] RDX: 0000555a3fd150f0 RSI: 0000555a3fd15e20 RDI: 0000555a52ccd230
Jul  2 16:29:42 ilu-server kernel: [2265746.751613] RBP: 0000555a3fd15a20 R08: 0000555a51eefa50 R09: 0000555a52cccfe0
Jul  2 16:29:42 ilu-server kernel: [2265746.751613] R10: fffffffffffff000 R11: 0000555a54da6000 R12: 0000000000000000
Jul  2 16:29:42 ilu-server kernel: [2265746.751613] R13: 0000555a52cccfe0 R14: 0000000000000000 R15: 000000006452d0c8
Jul  2 16:29:42 ilu-server kernel: [2265746.751613] FS:  00007f402217d780 GS:  0000000000000000
Jul  2 16:29:42 ilu-server kernel: [2265746.751709] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.751814] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.751912] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752012] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752109] NMI backtrace for cpu 8 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752251] NMI backtrace for cpu 9 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752304] NMI backtrace for cpu 10 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752401] NMI backtrace for cpu 11 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752504] NMI backtrace for cpu 12 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752606] NMI backtrace for cpu 13 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752704] NMI backtrace for cpu 14 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752804] NMI backtrace for cpu 15 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.752905] NMI backtrace for cpu 16 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.753001] NMI backtrace for cpu 17 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.753106] NMI backtrace for cpu 18 skipped: idling at intel_idle+0x87/0x130
Jul  2 16:29:42 ilu-server kernel: [2265746.753208] NMI backtrace for cpu 19 skipped: idling at intel_idle+0x87/0x130

We noticed to sometimes happen when we had multiple jobs on the GPU or when someone killed or interrupted the process. (However, we don’t think this is the only cause.)

We also notice that the GPU could sometimes kill a process and throw a Xid 32 and keep fully working.

# /var/log/kern.log
May 17 22:32:49 ilu-server kernel: [11655.586024] NVRM: GPU at PCI:0000:65:00: GPU-510f78af-ece2-3a18-237b-f3376ca6d914
May 17 22:32:49 ilu-server kernel: [11655.586028] NVRM: GPU Board Serial Number: 0325018137217
May 17 22:32:49 ilu-server kernel: [11655.586030] NVRM: Xid (PCI:0000:65:00): 32, pid=359, Channel ID 00000012 intr0 00040000
May 17 22:32:49 ilu-server kernel: [11655.586328] NVRM: Xid (PCI:0000:65:00): 32, pid=13414, Channel ID 00000012 intr0 00040000
# my process
Track generation: 1188..1489 -> 301-tiles track
2020-05-17 22:32:49.936130: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
[ilu-server:13414] *** Process received signal ***
[ilu-server:13414] Signal: Aborted (6)
[ilu-server:13414] Signal code: (-6)
[ilu-server:13414] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fa55f25b890]
[ilu-server:13414] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fa55ee96e97]
[ilu-server:13414] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fa55ee98801]
[ilu-server:13414] [ 3] /home/besteves/miniconda3/envs/sb/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x76cc9f7)[0x7fa520b419f7]
[ilu-server:13414] [ 4] /home/besteves/miniconda3/envs/sb/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0xf4)[0x7fa520df7764]
[ilu-server:13414] [ 5] /home/besteves/miniconda3/envs/sb/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0xce)[0x7fa520df7f6e]
[ilu-server:13414] [ 6] /home/besteves/miniconda3/envs/sb/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x2e2)[0x7fa518aac522]
[ilu-server:13414] [ 7] /home/besteves/miniconda3/envs/sb/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7fa518aa9738]
[ilu-server:13414] [ 8] /home/besteves/miniconda3/envs/sb/bin/../lib/libstdc++.so.6(+0xc8421)[0x7fa50b339421]
[ilu-server:13414] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fa55f2506db]
[ilu-server:13414] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fa55ef7988f]
[ilu-server:13414] *** End of error message ***
Aborted (core dumped)

We tried to reinstall the cuda and driver, and also use on ML works the official cuda (instead of the conda version) without effect. We also ruled out overheating as the problem.

I also ran gpu-burn and cuda_memtest and found no issues.

Might this be a case of hw problem like it is referred on other posts with this Xid?

We have been having this problem for some time now, and I thought this forum could be helpful and give some insight.

Any help/advice is deeply appreciated!

nvidia-bug-report_2_july.log.gz.txt (391.0 KB)
nvidia-bug-report_10_may.log.gz.txt (754.4 KB)
nvidia-bug-report_30_april.log.gz.txt (747.0 KB)