Nvidia-smi A6000 GPU Fan ERR!

I have a ubuntu 16.04.7(Linux 4.4.0-210-generic). system with 8 RTXA6000(SLI NVLINK), while I was running some python programs using GPU, the program just stopped and the nvidia-smi shows that one of the two GPUs has an error of GPU Fan. I will attach the screenshot and nvidia-bug-report.

I do need some help, could someone help me figure it out?

dmesg |grep NVRM

[ 735.795596] nvidia-uvm: Loaded the UVM driver, major device number 240.
[ 982.069261] NVRM: GPU at PCI:0000:e1:00: GPU-a9c5e1f6-cf9c-b370-c693-e6b57e6db987
[ 982.069359] NVRM: GPU Board Serial Number: 1322021047991
[ 982.069364] NVRM: Xid (PCI:0000:e1:00): 62, pid=3972, 0000(0000) 00000000 00000000
[ 982.093605] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 00000008
[ 983.093374] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 00000009
[ 983.093840] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000a
[ 983.094263] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000b
[ 983.094693] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000c
[ 983.095109] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000d
[ 983.095526] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000e
[ 983.095942] NVRM: Xid (PCI:0000:e1:00): 45, pid=4821, Ch 0000000f
[ 983.592862] NVRM: Xid (PCI:0000:e1:00): 31, pid=4821, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x11_f4a9a000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Xid codes are here: XID Errors :: GPU Deployment and Management Documentation

It’s possibly a hardware fault. Have you acertained the fan is actually operating?
A full cold restart, with power removed may be worth a try.

1 Like

Fan is Runing. Chang Slot (1 → 2) run 1 Day has ERR!
The issue still exists. I tried reseating the card, it doesn’t solve the problem.

Run gpu_burn does not report any issue with the cards.

Did you fix the problem, I met same problem with A6000.

It’s a hardware fault.

Thanks,and can you tell me is it GPU fault or CPU fault?

Is GPU(NVIDIA A6000)

thank you, bro

Has anyone found a solution to this problem with the A6000? Does a hardware fault imply that there is something physically wrong with the GPU that needs to be repaired?