Thank you @eugene.debeste !
The provider’s system is indeed virtualized (with QEMU) and has PCIe H100’s, they are working now on attempting to reproduce the issue by themselves.
And just for the record on the types of errors that we’ve been dealing with (nvidia driver 535.183.01
):
XID errors 94
, 140
, 31
, 43
, 95
, 63
occurred in descending order of frequency:
NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 00000008
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 00000009
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000a
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000b
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000c
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000d
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000e
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000f
NVRM: Xid (PCI:0000:00:07): 43, pid=1441160, name=pt_main_thread, Ch 00000008
NVRM: Xid (PCI:0000:00:07): 43, pid=1443554, name=pt_main_thread, Ch 00000008
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 2
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 2
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU at PCI:0000:00:08: GPU-835ba0d7-7218-84be-af2e-2beba13420e8
NVRM: GPU Board Serial Number: 1650623020125
NVRM: Xid (PCI:0000:00:08): 31, pid=2908924, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0c: GPU-999bc7c9-951c-cb6c-d72f-ab611abdc2fc
NVRM: GPU Board Serial Number: 1650623020082
NVRM: Xid (PCI:0000:00:0c): 31, pid=2908928, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0b: GPU-a2fba816-ad60-752b-aedd-376ac341745f
NVRM: GPU Board Serial Number: 1650623020058
NVRM: Xid (PCI:0000:00:0b): 31, pid=2908927, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0d: GPU-d8103984-1a5b-388b-d045-02b765eca3cd
NVRM: GPU Board Serial Number: 1650623020016
NVRM: Xid (PCI:0000:00:0d): 31, pid=2908929, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:07: GPU-2c390139-d85f-7eee-365e-8a00244025ad
NVRM: GPU Board Serial Number: 1650623011793
NVRM: Xid (PCI:0000:00:07): 31, pid=2908923, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0a: GPU-69a138b9-be4e-603f-ed74-6d10844329f5
NVRM: GPU Board Serial Number: 1650223015186
NVRM: Xid (PCI:0000:00:0a): 31, pid=2908926, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:09: GPU-d547b3e3-89c4-0c85-b81d-b6b15e62e10a
NVRM: GPU Board Serial Number: 1650723017029
NVRM: Xid (PCI:0000:00:09): 31, pid=2908925, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0d): 31, pid=3404743, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_72a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0c): 31, pid=3404742, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:08): 31, pid=3404738, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:09): 31, pid=3404739, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:07): 31, pid=3404737, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0b): 31, pid=3404741, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0a): 31, pid=3404740, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0b): 31, pid=3684110, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0a): 31, pid=3684109, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:07): 31, pid=3684106, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:08): 31, pid=3684107, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0d): 31, pid=3684112, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_92a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0c): 31, pid=3684111, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:09): 31, pid=3684108, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 11 04:48:16 node3 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:09: GPU-f9d74506-a27c-4168-bb62-0910f23e9a31
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650623011938
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2110660, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:07: GPU-1e778966-789f-658a-726c-34e5253f7b31
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017460
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2110658, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-6bd76402-49c7-14b0-cf4e-9706661f2b14
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017058
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2110663, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:08: GPU-c8544924-fd73-e0ed-2644-d83eb7dd7658
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723016962
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2110659, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-6e8751ce-fecf-53b6-7682-10facc66681b
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017519
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2110662, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-23981421-5eb6-13b9-312b-8e01bbbcec23
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017408
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2110661, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0d: GPU-000e1b97-b118-337c-71a2-e67b64f05220
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723016956
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0d): 31, pid=2110664, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 00000008
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 00000009
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000a
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000b
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000c
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000d
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000e
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000f
Aug 19 18:38:13 node3 kernel: NVRM: GPU at PCI:0000:00:09: GPU-f9d74506-a27c-4168-bb62-0910f23e9a31
Aug 19 18:38:13 node3 kernel: NVRM: GPU Board Serial Number: 1650623011938
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 00000008
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 00000009
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000a
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000b
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000c
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000d
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000e
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000f
Jul 27 20:33:31 node4 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
Jul 28 08:50:35 node4 kernel: NVRM: GPU at PCI:0000:00:07: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
Jul 28 08:50:35 node4 kernel: NVRM: GPU Board Serial Number: 1652923017935
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid='<unknown>', name=<unknown>, Contained: CE User Channel (0xb). RST: No, D-RST: No
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 00000008
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 00000009
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000a
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000b
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000c
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000d
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000e
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000f
Jul 28 08:51:33 node4 kernel: NVRM: GPU at PCI:0000:00:08: GPU-c24840ec-8de1-83d5-b126-08000173ae32
Jul 28 08:51:33 node4 kernel: NVRM: GPU Board Serial Number: 1652923018111
Jul 28 08:51:33 node4 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
Jul 28 09:15:32 node4 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
Jul 31 23:14:17 node4 kernel: NVRM: GPU at PCI:0000:00:05: GPU-efbdfde9-5798-a6e7-4c46-12518fa15375
Jul 31 23:14:17 node4 kernel: NVRM: GPU Board Serial Number: 1650423013443
Jul 31 23:14:17 node4 kernel: NVRM: Xid (PCI:0000:00:05): 43, pid=1045242, name=pt_main_thread, Ch 00000008
Jul 31 23:15:25 node4 kernel: NVRM: Xid (PCI:0000:00:05): 43, pid=1084984, name=pt_main_thread, Ch 00000008
Aug 05 03:54:11 node4 kernel: NVRM: GPU at PCI:0000:00:07: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
Aug 05 03:54:11 node4 kernel: NVRM: GPU Board Serial Number: 1652923017935
Aug 05 03:54:11 node4 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2512125, name=python3.10, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x73ac_1c000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
Aug 09 13:32:51 node4 kernel: NVRM: GPU at PCI:0000:00:08: GPU-c24840ec-8de1-83d5-b126-08000173ae32
Aug 09 13:32:51 node4 kernel: NVRM: GPU Board Serial Number: 1652923018111
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=3780430, name=pt_main_thread, Ch 00000008
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=3780429, name=pt_main_thread, Ch 00000008
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=3780430, name=pt_main_thread, Ch 00000009
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=3780429, name=pt_main_thread, Ch 00000009
...
...
Aug 10 11:04:07 node4 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:08: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017935
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=3253951, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:09: GPU-c24840ec-8de1-83d5-b126-08000173ae32
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923018111
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=3253952, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0d: GPU-f790ae43-c5a4-fe11-d524-657843e0c85d
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1650623011704
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0d): 31, pid=3253956, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_cea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:07: GPU-adbf98be-b5d4-cdff-4807-e5c096c81db8
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017890
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=3253950, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-95733769-1b5a-ab5f-ca42-8da0237cf8d7
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017827
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=3253955, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-3747688c-1804-eadc-5bf7-525bf0e97233
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017985
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=3253953, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-c2a5fa97-ed5f-41b7-8afc-3107c9aeabb2
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017448
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=3253954, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 00000008
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 00000009
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000a
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000b
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000c
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000d
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000e
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000f
Aug 18 22:11:01 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 22:11:01 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3282927, name=pt_main_thread, Ch 00000008
Aug 18 22:11:01 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3282927, name=pt_main_thread, Ch 00000009
...
Jul 27 21:03:39 node6 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
Jul 28 12:57:48 node6 kernel: NVRM: GPU at PCI:0000:00:05: GPU-7033aba4-bd61-b232-aefd-82b60b5bad52
Jul 28 12:57:48 node6 kernel: NVRM: GPU Board Serial Number: 1652923017484
Jul 28 12:57:48 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=58448, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_14 faulted @ 0x76f5_6aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 28 15:58:00 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1143621, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7898_3aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 29 07:59:23 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1305565, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7c86_0ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 29 15:45:07 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2283656, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x708b_bea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 29 18:03:48 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2747406, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7266_5aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 31 10:16:22 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2875767, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x7c41_2aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 31 11:50:40 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=981508, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7464_02a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 04:16:15 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1062351, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x748c_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 08:19:20 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2032503, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7534_0aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 12:18:15 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2237507, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x7cbe_cea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 16:34:35 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2466105, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x755e_e6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 20:40:52 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2698993, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7602_c6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-0f4c17d8-d2a5-bb1c-610a-48700ce11a3a
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011855
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2906358, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-9bda9cc7-dc78-2b6f-f206-6ab8ea8dcad5
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1652923017933
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2906357, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-bd281832-e44f-e4ee-377b-d5807fc3a5eb
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011647
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2906356, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:07: GPU-286883c5-ad43-eb3e-2259-257aeb552296
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011605
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2906353, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:09: GPU-87f9a74c-f99a-9888-8026-350fb3070740
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011670
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2906355, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:06: GPU-d371d7d9-87e9-62fd-dc48-ded1953889ae
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011690
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=2906352, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:08: GPU-0f2248ea-9a45-5498-0dd8-17d5288e2779
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1652923017387
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2906354, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=4004299, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=4004301, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7cc4_8aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=4004300, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=4004298, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=4004297, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=4004303, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7cc4_8aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=4004302, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 04 11:26:59 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=4116123, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7276_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 04 23:04:39 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1982354, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x70d6_4ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 04 23:12:58 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2581605, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x798b_26a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 05 19:21:06 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2591483, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x729f_3ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 07 21:19:17 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=187395, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x7c8c_3aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 00:07:39 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2271239, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_9 faulted @ 0x7305_92a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2418069, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2418068, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2418075, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_7 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=2418067, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2418073, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_15 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2418070, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2418074, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_13 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 08:25:55 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2514252, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x730d_f6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2836999, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_14 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2836997, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_15 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2837000, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2836996, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_13 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=2836994, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_12 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2836995, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2836998, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_11 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
...
Jul 28 20:21:31 node7 kernel: NVRM: GPU at PCI:0000:00:08: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2
Jul 28 20:21:31 node7 kernel: NVRM: GPU Board Serial Number: 1650723017142
Jul 28 20:21:31 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
Jul 28 20:21:31 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 00000008
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 00000009
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000a
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000b
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000c
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000d
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000e
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000f
Jul 29 07:54:16 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:17 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:18 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:19 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:19 node7 kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:19 node7 kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2
Jul 29 07:54:20 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:21 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:22 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:23 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:23 node7 kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:23 node7 kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2
Jul 29 07:54:24 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:25 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:26 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:27 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:27 node7 kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:27 node7 kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3
Jul 29 07:54:28 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:29 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:30 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:31 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:31 node7 kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:31 node7 kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3
...
...
Aug 10 10:12:31 node8 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0d: GPU-624baca6-99f9-1923-de58-c1c3e2127948
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1650723017118
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0d): 31, pid=321832, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_82a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-c08c7e03-968e-a9d4-41df-6bd227312ccc
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017329
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=321830, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-7b0ead78-b96a-ac40-c015-6e57d219eb1b
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1650623011835
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=321831, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:07: GPU-264bcba1-c564-3380-77da-2c0a01a37e90
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923018062
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=321826, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:08: GPU-74e4ee32-858d-0f8d-f2ad-ba474a3d4819
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017434
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=321827, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-b5d19501-13d9-ca5d-6b6a-8139f44d9bbd
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017315
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=321829, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:09: GPU-fd5efc82-dde2-5451-7959-99dc0d46a6b3
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017698
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=321828, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 00000008
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 00000009
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000a
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000b
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000c
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000d
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000e
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000f