Wrong description:
Server Hardware Log
A bus fatal error was detected on a computer at slot 5.
A fatal error was detected on a computer at bus 133 device 0 function 0.
Three devices have similar failures.After checking the logs with the server hardware manufacturer, the hardware of the reply device did not fail. Let upgrade case go to NVIDIA.
After the failure, the device reappears and the driver is lost.
According to the logs, neither the nvidia-persistenced was running nor persistence mode was enabled. Furthermore, no nvidia related error is visible, all 5 T4s are online.
I checked the reboot record and found that the device was rebooted once after the change, and the nvidia-persistence enablement was not added to the boot before. Now re-enabled under the observation to see.
Should this be the problem of driving persistence?
Running multiple gpus headless without the persistenced running is not supported and can lead to all kinds of odd behaviour.
I can’t really tell what is happening on your system since there are no errors in the logs.