Pciehp 0000:4c:01.0:pcie204: Slot(2-1): Card not present,

categraf: 2024/01/20 00:19:03 systemd_linux.go:178: E! couldn’t get unit NRestarts unit chronyd.service err Unknown property or interface.
categraf: 2024/01/20 00:19:03 systemd_linux.go:178: E! couldn’t get unit NRestarts unit nslcd.service err Unknown property or interface.
kernel: pciehp 0000:4c:01.0:pcie204: Slot(2-1): Card not present
kernel: pciehp 0000:4c:01.0:pcie204: Slot(2-1): Link Down
kernel: NVRM: GPU at PCI:0000:4e:00: GPU-6d34b5e2-a686-f21c-83b7-3b36cb566060
kernel: NVRM: Xid (PCI:0000:4e:00): 79, pid=‘’, name=, GPU has fallen off the bus.
kernel: NVRM: GPU 0000:4e:00.0: GPU has fallen off the bus.
kernel: NVRM: A GPU crash dump has been created. If possible, please run#012NVRM: nvidia-bug-report.sh as root to collect this data before#012NVRM: the NVIDIA kernel module is unloaded.
kernel: snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD buf size -1
kernel: snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD buf size -1
kernel: snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD buf size -1
kernel: snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD buf size -1
mmsysmon: [I] Event raised: All quorum nodes are reachable PC_QUORUM_NODES
mmsysmon: [I] Event raised: The state of component GPFS changed to HEALTHY.
mmsysmon: [I] Event raised: The state of this node changed to HEALTHY.
kernel: NVRM: Attempting to remove device 0000:4e:00.0 with non-zero usage count!
systemd: Starting titanagent check exception…
systemd: Started titanagent check exception.
kernel: pciehp 0000:4c:01.0:pcie204: Slot(2-1): Card present
kernel: pciehp 0000:4c:01.0:pcie204: Slot(2-1): Link Up’

What are the possible causes of this problem?

Insufficient/broken PSU or overheating.

Which error code can indicate this problem, only one of the 20 batch devices in the pressure test has this problem, if it is the power supply problem, how to monitor it?
nvidia-bug-report.log.gz (1.6 MB)

Switch power cords with a working gpu, monitor temperatures to rule out overheating. If both don’t change behaviour, the gpu is likely broken, you shoud test it on its own in another system.