Xorg Error: Failed to initialize DMA and PCIe Error in slot for RTX 5000 GPU

During several mouths of program development, we totally met a PCIe issue for 7 times on 7 different consoles. Six of them happened on slot for RTX 5000 GPU and only one happened on slot for Dip card (GE device which is used to acquire data from Xray detector) after software crash/shutdown when doing scanning (GPU is working for recon). Till now we have not find a way to reproduce this error.

We successfully collected some logs in last 3 times on the system where the error happened (Slot 4: RTX 5000), and found below error message in Xorg.0.log.old (not Xorg.0.log)

(EE) NVIDIA(GPU-1): Failed to initialize DMA.
(EE) NVIDIA(1): Failed to allocate push buffer

In the starting before the error message is showed, an abnormal reboot happened and no below message is showed in /var/log/messages before reboot.

2019-08-27T07:51:22.243428+08:00 ct25 gdm-autologin]: pam_unix(gdm-autologin:session): session opened for user ctuser by (uid=0)

Below attachments are uploaded:

  1. The screenshot when the error was showed
  2. ct99-issue-…NVIDIA bug report from console where the issue happened
  3. ct25-normal-…NVIDIA bug report from console where no such happened

Below is the situation in this 7 happening:

  1. System crashed after an axial scan then this issue (Error on Slot for Dip card)
  2. During App installation, this error happened after OS installation (app not installed yet)
  3. During App installation, this error happened after app installation (app installed, but not rebooting)
  4. (Update details after checking with colleague )
  5. Configuring on CT software then reboot system as system required. During OS shutdown, screen became black(nothing showed) and freeze until tester power off forcefully.
  6. Forcefully powered off system for two times
  7. Issue happened during executing reboot HAST. This HAST was executing to reproduce issue

PCIE-error.jpg
ct25-normal-nvidia-bug-report.log.gz (2.01 MB)
ct99-issue-nvidia-bug-report.log.gz (2.01 MB)

PCIE-error.jpg

That rather looks like a HW problem and since it also happened with a different pcie card this might be the mainboard failing.
Please try reseating the card, maybe even changing slots if possible.

Thanks for your suggestion. This issue already happened in different system. However it cannot be reproduced currently so it is very hard to investigate. We are trying to reproduce it by reboot system again and again and it was already reproduced twice on same system in this testing.

For our produce each slot is assigned for different devices, so we cannot change slots to resolve it. We need to figure out the root cause and fix it.