GeForce GTX 1050 Ti on Epyc Milan running Ubuntu 20.04 - GPU has fallen off the bus

Hi,
thanks for having a look into this.
The card crashes/detaches resulting in a frozen or blank screen under many different loads and drivers. To produce the attached report, I used:

  • version 510.54 of the NVIDIA drivers (Ubuntu packages)
  • GpuTest 0.7 (volplosion)

Same issue with NVIDIA driver versions 390 and 470, Other stress tests and normal operation. The motherboard is flashed with the latest BIOS.

On the positive side, the card works without trouble in a different Intel-based machine - Ubuntu 20.04, NVIDIA drivers version 470.

Henrik

nvidia-bug-report.log (2.6 MB)

You are getting a lot of different XiD errors.
https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_2

As the card is working on another machine, I’d try a different slot and check exchange cabling. Maybe try another PSU.

Hi Mart,
thanks you very much for you swift response. A few questions:

  • Where do I find those XiD errors in the log? Searched for “XiD” and “XID”. These strings are only found in the base64 encrypted parts.
  • What cabling? The HDMI cable?
  • I already tried a different slot without lasting success.
  • PSU. Makes some sense, but …
    Henrik

In you bug report. dmesg, journalctl, kernel log…
All the cables, especially PSU connectors.

Since the 1050 is bus-powered only, I guess power should be fine. If the gpu is really undamage, there seems to be some general incompatibility with the epyc platform. Two things to try:

  • upgrade kernel using the liquorix ppa
  • limit pcie speeds to gen2 in bios if possible.

Hi Mart,
ok, I found the messages. They are from different test scenarios, thus.
a) Last boot: There are three after the last reboot (61, 8, 79). Here I was running volplosion only.
b) boots before: Many 56s + others. Here I stressed the machine to the limit (hpcg on 32 CPUs + volplosion) and it is possible that there is a problem with the PSU. This is why I tried with idle CPUs (see a).

Another thing, initially the card failed on another machine (1x 24 cores Epyc Rome, Ubuntu 20.04). This machine also has motherboard from a different vendor. I can dig out the details + produce logs if needed.

Henrik

Hi generix,
a) The GPU is undamaged. I plugged it into an Intel-based machine where it is happily running volplosion for 2.5 hours now. The crash on the Epyc machine occured after 1.5 hours - so I will leave it running over night.
b) The card initially failed in single core Epyc Rome. See earlier post.
c) I will give your suggestions a try.
Henrik

It’s also worth monitoring temperatures on the 1050 using nvidia-smi, maybe something is blocking airflow in the epyc chassis.

Hi generix,
upgrading the kernel does the trick. I am now able to run both stress tests (hpcg on 32 CPUs + volplosion) together.

Thanks a lot, this really helped me a lot.
Henrik

Hi everybody,
the machine failed again.
Now I do not need any tests as I do not get the desktop started at all. On reboots the screen is garbled with red lines or a greyish background with colored dots.

The only error I can spot in the logs is

[ 10.153540] NVRM: Xid (PCI:0000:01:00): 62, pid=2319, 00c2(9ffc) 00000000 00000000

Maybe somebody is able to help.

Henrik

nvidia-bug-report.log.gz (202.2 KB)

The more important errors are the rminit failed messages you’re getting now:

NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x25:0x65:1451)

which point to either a now defectice gpu or a general hw-level incompatibility. Did you already try lowering pcie speeds?

Hi genetix,
Tx for reminding me. I tried, but could not find a setting to switch to PCI gen 2 in the BIOS of the MZ72-HB0 (rev. 3) MB. There are a bunch of PCIe setting (ARI Support, ARI Enumeration, PCIe Ten Bit Tag Support) which I did not understand/try.

  • I switched to x8 and I/O ROM=disabled, but neither made a difference.
  • I also tried different slots + CMOS reset. This used to help from time to time, but the problems are more consistent.

Henrik