GPU: 8x RTX A6000 Ada
System: Ubuntu 22.04
Platform: Supermicro Server AS -4125GS-TNRT2
I changed the graphics card mode from: physical_display_enabled_256MB_bar1 to: physical_display_disabled
After this change the system does not boot, a message pops up:
mdadm: error opening /dev/md?*: No such file or directory
Gave up waiting for root file system device. Common problems:
Boot args (cat /proc/cmdline)
Check rootdelay= (did the system wait long enough?).
Missing modules (cat /proc/modules; ls /dev)
UUID=78277ab8-d12a-4a30-93cf-42340fb3802f does not exist. Dropping to a shell!
After disconnecting the graphics cards, the system works fine.
I extended the system wait to 30 s - it did not help.
I mapped the drive from which the system boots as /dev/md0p2
Also this did not change anything.
I have two Samsung PM9A3 3.84TB U.2 NVMe PCI disks tied together in RAID 1.
Do you know what could be the cause and how to fix it?
I suspect exhausted address space so the nvme didn’t get mapped. In that configuration, the gpus need 384GB mappable address space. Is Above 4G decoding/64bit/large BARs enabled in bios and CSM disabled? Does it boot when you remove some gpus?
Hi, thank you for your reply.
4G is enabled.
On one Graphics Card also linux does not boot.
The server was running with 8 graphics cards before the graphics mode change.
Please provide a dmesg output when all gpus are remove so the system boots.
Every pci root bus provides about 7TB of mappable address space so this shouldn’t be an issue. The io space is quite small on bus 00, but the requirements shouldn’t change when the gpu mode is changed.
With a single gpu, did you try different slots, e.g. the last one instead of the first one? The nvme devices sitting on root bus 0000:c0, though.
kernel parameters to try:
Thank you! Setting: pci=realloc=off helped to turn on Linux.
Do you know why the amount of graphics memory decreased from 49140MiB to 46068MiB ?
I’d guess ECC was turned on in the process.
So with ECC enabled, all memory is not available?
Is it possible to turn it off?
With ECC enabled, a part of the memory is used for parity.
Can be en-/disabled using nvidia-smi.
I understood, thank you very much for your help.