GPU: 8x RTX A6000 Ada
System: Ubuntu 22.04
Platform: Supermicro Server AS -4125GS-TNRT2
I changed the graphics card mode from: physical_display_enabled_256MB_bar1 to: physical_display_disabled
After this change the system does not boot, a message pops up:
mdadm: error opening /dev/md?*: No such file or directory
done.
Gave up waiting for root file system device. Common problems:
Boot args (cat /proc/cmdline)
Check rootdelay= (did the system wait long enough?).
Missing modules (cat /proc/modules; ls /dev)
ALERT!
UUID=78277ab8-d12a-4a30-93cf-42340fb3802f does not exist. Dropping to a shell!
After disconnecting the graphics cards, the system works fine.
I extended the system wait to 30 s - it did not help.
I mapped the drive from which the system boots as /dev/md0p2
Also this did not change anything.
I have two Samsung PM9A3 3.84TB U.2 NVMe PCI disks tied together in RAID 1.
Do you know what could be the cause and how to fix it?
I suspect exhausted address space so the nvme didn’t get mapped. In that configuration, the gpus need 384GB mappable address space. Is Above 4G decoding/64bit/large BARs enabled in bios and CSM disabled? Does it boot when you remove some gpus?
Every pci root bus provides about 7TB of mappable address space so this shouldn’t be an issue. The io space is quite small on bus 00, but the requirements shouldn’t change when the gpu mode is changed.
With a single gpu, did you try different slots, e.g. the last one instead of the first one? The nvme devices sitting on root bus 0000:c0, though.
kernel parameters to try:
pci=realloc
pci=realloc=off
pci=nocrs