Tesla P4 is used for video transcoding, the server reported bus error, looked for the manufacturer to see that there is no hardware failure, pointing to the graphics card.

Wrong description:
Server Hardware Log
A bus fatal error was detected on a computer at slot 5.
A fatal error was detected on a computer at bus 133 device 0 function 0.

Three devices have similar failures.After checking the logs with the server hardware manufacturer, the hardware of the reply device did not fail. Let upgrade case go to NVIDIA.

After the failure, the device reappears and the driver is lost.

Server model:DELL R740
nvidia-bug-report.log.gz (275 KB)
nvidia-new.tar.gz (259 KB)

It’s truncated, please delete the wall of text, and attach the log as file. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
Furthermore, please describe your problem properly, which error is displayed?


Please enable the nvidia-persistenced to start on boot and check if that resolves the issue.

I just set it up today and turned off pcie_aspm=off. Energy-saving mode. I need to watch it for a while. Thank you.

Persistence was enabled and a failure occurred. Guide the latest graphics card logs, help to see what the problem is

According to the logs, neither the nvidia-persistenced was running nor persistence mode was enabled. Furthermore, no nvidia related error is visible, all 5 T4s are online.

I checked the reboot record and found that the device was rebooted once after the change, and the nvidia-persistence enablement was not added to the boot before. Now re-enabled under the observation to see.
Should this be the problem of driving persistence?

Running multiple gpus headless without the persistenced running is not supported and can lead to all kinds of odd behaviour.
I can’t really tell what is happening on your system since there are no errors in the logs.

Okay, thank you. I turn on persistence and see if there’s still a problem.