How to deal with "FAULT_PDE ACCESS_TYPE_VIRT_WRITE"?

The log is following:
11月 15 18:51:58 dell-PowerEdge-T640 kernel: NVRM: GPU at PCI:0000:b1:00: GPU-2c48c2eb-cdc7-02bf-ac64-aa3ef8c88e96
11月 15 18:51:58 dell-PowerEdge-T640 kernel: NVRM: GPU Board Serial Number: 0324218061405
11月 15 18:51:58 dell-PowerEdge-T640 kernel: NVRM: Xid (PCI:0000:b1:00): 31, pid=5588, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x7f26_08ebc000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
11月 15 18:51:59 dell-PowerEdge-T640 kernel: NVRM: Xid (PCI:0000:b1:00): 62, pid=5594, 1ed6(ffffe8) 00000000 00000000
11月 15 18:52:29 dell-PowerEdge-T640 kernel: NVRM: Xid (PCI:0000:b1:00): 38, pid=5594, 0000 00000000 00000000 00000000
11月 15 18:54:23 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: RmInitAdapter failed! (0x24:0x65:1184)
11月 15 18:54:23 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: rm_init_adapter failed, device minor number 2
11月 15 18:54:27 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: RmInitAdapter failed! (0x24:0x65:1184)
11月 15 18:54:27 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: rm_init_adapter failed, device minor number 2
11月 15 18:54:33 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: RmInitAdapter failed! (0x24:0x65:1184)
11月 15 18:54:33 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: rm_init_adapter failed, device minor number 2
11月 15 18:54:39 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: RmInitAdapter failed! (0x24:0x65:1184)
11月 15 18:54:39 dell-PowerEdge-T640 kernel: NVRM: GPU 0000:b1:00.0: rm_init_adapter failed, device minor number 2

1 Like

Since the gpu is crashing completely, you should rather look into hw issues. Please try reseating the gpu in its slot, check if it works in another system, monitor temperatures, check psu.

I have checked the temperature and outputted it per second, but it was below 80℃ all the time. In addition, I reinstalled the drivers for different version, and reseated the gpu in its slot as well as seated on other slots, even other server, but it still didn’t work.
Besides, I found anothor exception “Graphics SM Warp Exception on (GPC *, TPC *, SM *): Out Of Range Register” always occurs.

  • A number between 0 to 5

I think it’s faulty and you should RMA it.

“Out of range register” means the register was broken?

No, RmInitAdapter failed means the gpu is likely broken.

Thanks for your detailed reply! The GPU didn’t occur Exception anytime. So, do I really have to RMA it? If so, I must wait for a long term to get it back.