I have a ubuntu 16.04.7(Linux 4.4.0-210-generic). system with 8 RTXA6000(SLI NVLINK), while I was running some python programs using GPU, the program just stopped and the nvidia-smi shows that one of the two GPUs has an error of GPU Fan. I will attach the screenshot and nvidia-bug-report.
I do need some help, could someone help me figure it out?
dmesg |grep NVRM
[ 735.795596] nvidia-uvm: Loaded the UVM driver, major device number 240.
[ 982.069261] NVRM: GPU at PCI:0000:e1:00: GPU-a9c5e1f6-cf9c-b370-c693-e6b57e6db987
[ 982.069359] NVRM: GPU Board Serial Number: 1322021047991
[ 982.069364] NVRM: Xid (PCI:0000:e1:00): 62, pid=3972, 0000(0000) 00000000 00000000
[ 982.093605] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 00000008
[ 983.093374] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 00000009
[ 983.093840] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000a
[ 983.094263] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000b
[ 983.094693] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000c
[ 983.095109] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000d
[ 983.095526] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000e
[ 983.095942] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000f
[ 983.592862] NVRM: Xid (PCI:0000:e1:00): 31, pid=4821, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x11_f4a9a000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
rs277
March 11, 2022, 6:20pm
3
Xid codes are here: XID Errors :: GPU Deployment and Management Documentation
It’s possibly a hardware fault. Have you acertained the fan is actually operating?
A full cold restart, with power removed may be worth a try.
1 Like
Fan is Runing. Chang Slot (1 → 2) run 1 Day has ERR!
The issue still exists. I tried reseating the card, it doesn’t solve the problem.
Run gpu_burn does not report any issue with the cards.
Did you fix the problem, I met same problem with A6000.
psyduck
February 14, 2023, 10:10am
8
Thanks,and can you tell me is it GPU fault or CPU fault?
ckauten
February 18, 2023, 4:03pm
11
Has anyone found a solution to this problem with the A6000? Does a hardware fault imply that there is something physically wrong with the GPU that needs to be repaired?