nvidia failed to load with kernel modules properly installed both Ubuntu 18.04 and Arch

Hi, I am having some problems with a Titan X and geforce RTX 2080Ti on my workstation (I tried both ubuntu 18.04 and Arch) . Here are some details, I am also attaching the bug reports and dmesg logs. Thanks in advance!

Some info about the system:

  • Motherboard: Asus ws x299 sage (I installled the last bios: version 1001)
  • psu 1600w
  • 128 GB RAM (as 8x 16GB cards)

The two cards:
67:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)

I installed the nvidia drivers (430.26) but also tried with older versions (up to 390). However, whenever running nvidia-smi I always get one of the following error messages:

  • ‘no device found’, or
  • something like ‘not able to communicate with the device, verify that you have the latest drivers installed’

Some notes/observations: all these problems are experienced even before trying to run xorg.

  1. If I run ‘cat /proc/driver/nvidia/gpus/*/information’ I get that both cars have unknow model but the Titan has also lots of questions marks on the GPU UUID, while the GeForce has a proper UUID. However I tried to use either cards alone but the same issues remain.

  2. when starting with a fresh installation of the OS, until I add the “nomodeset” to the boot loader, the video gets corrupted (see picture here https://www.dropbox.com/s/d7f93aniz385r7f/20190709_153103.jpg?dl=0)

  3. dmesg

  • running dmseg I note that the RmInitAdapter failed with error (0x26:0xffff:1155) which I did not find online anywhere.
  • there is a “resource sanity check” that seems to highlight that the cards are requiring more memory than it seems would be allowed by the system

dmesg at https://www.dropbox.com/s/3efzeg8rgdjb7r9/dmesg.log?dl=0

nvidia-bug-report.log at https://www.dropbox.com/s/3efzeg8rgdjb7r9/dmesg.log?dl=0
dmesg.log (82.8 KB)
nvidia-bug-report.log (863 KB)

The RmInitadapter failed message would point to defective hardware, please check them on by one, reseat them, test in another system.

I have just tested the two cards individually on another system where other Nvidia cards have been previously successfully used. I keep experiencing the same problem, so it would seem to be indeed a hardware issue as suggested.

Thanks a lot. What a chance that two newly bought cards would be defective…

Just to double check: can I be sure that the “resource sanity check” message is not a symptom of any compatibility issue between motherboard and cards?

That message is merely a sign that the gpu was reset by the driver.

ok thanks