Hi,
thanks for having a look into this.
The card crashes/detaches resulting in a frozen or blank screen under many different loads and drivers. To produce the attached report, I used:
version 510.54 of the NVIDIA drivers (Ubuntu packages)
GpuTest 0.7 (volplosion)
Same issue with NVIDIA driver versions 390 and 470, Other stress tests and normal operation. The motherboard is flashed with the latest BIOS.
On the positive side, the card works without trouble in a different Intel-based machine - Ubuntu 20.04, NVIDIA drivers version 470.
Since the 1050 is bus-powered only, I guess power should be fine. If the gpu is really undamage, there seems to be some general incompatibility with the epyc platform. Two things to try:
Hi Mart,
ok, I found the messages. They are from different test scenarios, thus.
a) Last boot: There are three after the last reboot (61, 8, 79). Here I was running volplosion only.
b) boots before: Many 56s + others. Here I stressed the machine to the limit (hpcg on 32 CPUs + volplosion) and it is possible that there is a problem with the PSU. This is why I tried with idle CPUs (see a).
Another thing, initially the card failed on another machine (1x 24 cores Epyc Rome, Ubuntu 20.04). This machine also has motherboard from a different vendor. I can dig out the details + produce logs if needed.
Hi generix,
a) The GPU is undamaged. I plugged it into an Intel-based machine where it is happily running volplosion for 2.5 hours now. The crash on the Epyc machine occured after 1.5 hours - so I will leave it running over night.
b) The card initially failed in single core Epyc Rome. See earlier post.
c) I will give your suggestions a try.
Henrik
Hi everybody,
the machine failed again.
Now I do not need any tests as I do not get the desktop started at all. On reboots the screen is garbled with red lines or a greyish background with colored dots.
Hi genetix,
Tx for reminding me. I tried, but could not find a setting to switch to PCI gen 2 in the BIOS of the MZ72-HB0 (rev. 3) MB. There are a bunch of PCIe setting (ARI Support, ARI Enumeration, PCIe Ten Bit Tag Support) which I did not understand/try.
I switched to x8 and I/O ROM=disabled, but neither made a difference.
I also tried different slots + CMOS reset. This used to help from time to time, but the problems are more consistent.