RTX A6000 Ada after change gpumode to physical_display_disabled ubuntu 22.04 doesn't work

kamil.fus · January 23, 2024, 7:20pm

GPU: 8x RTX A6000 Ada
System: Ubuntu 22.04
Platform: Supermicro Server AS -4125GS-TNRT2
I changed the graphics card mode from: physical_display_enabled_256MB_bar1 to: physical_display_disabled

After this change the system does not boot, a message pops up:

mdadm: error opening /dev/md?*: No such file or directory
done.
Gave up waiting for root file system device. Common problems:
Boot args (cat /proc/cmdline)
Check rootdelay= (did the system wait long enough?).
Missing modules (cat /proc/modules; ls /dev)
ALERT!
UUID=78277ab8-d12a-4a30-93cf-42340fb3802f does not exist. Dropping to a shell!

After disconnecting the graphics cards, the system works fine.
I extended the system wait to 30 s - it did not help.
I mapped the drive from which the system boots as /dev/md0p2
Also this did not change anything.

I have two Samsung PM9A3 3.84TB U.2 NVMe PCI disks tied together in RAID 1.

Do you know what could be the cause and how to fix it?

generix · January 24, 2024, 8:13am

I suspect exhausted address space so the nvme didn’t get mapped. In that configuration, the gpus need 384GB mappable address space. Is Above 4G decoding/64bit/large BARs enabled in bios and CSM disabled? Does it boot when you remove some gpus?

kamil.fus · January 24, 2024, 8:32am

Hi, thank you for your reply.
4G is enabled.

On one Graphics Card also linux does not boot.

The server was running with 8 graphics cards before the graphics mode change.

generix · January 24, 2024, 8:39am

Please provide a dmesg output when all gpus are remove so the system boots.

kamil.fus · January 24, 2024, 8:43am

dmesg.txt (240.9 KB)

generix · January 24, 2024, 9:38am

Every pci root bus provides about 7TB of mappable address space so this shouldn’t be an issue. The io space is quite small on bus 00, but the requirements shouldn’t change when the gpu mode is changed.
With a single gpu, did you try different slots, e.g. the last one instead of the first one? The nvme devices sitting on root bus 0000:c0, though.
kernel parameters to try:
pci=realloc
pci=realloc=off
pci=nocrs

kamil.fus · January 26, 2024, 9:56am

Thank you! Setting: pci=realloc=off helped to turn on Linux.

kamil.fus · January 26, 2024, 9:59am

Do you know why the amount of graphics memory decreased from 49140MiB to 46068MiB ?

generix · January 26, 2024, 10:10am

I’d guess ECC was turned on in the process.

kamil.fus · January 26, 2024, 10:31am

So with ECC enabled, all memory is not available?
Is it possible to turn it off?

generix · January 26, 2024, 10:43am

With ECC enabled, a part of the memory is used for parity.
Can be en-/disabled using nvidia-smi.

kamil.fus · January 26, 2024, 4:19pm

I understood, thank you very much for your help.

Topic		Replies	Views
No display from RTX 6000 Ada Linux pcie , boot , cuda , kernel , ubuntu	10	1477	March 10, 2023
RTX 6000 ADA / HP-DL380 gen9 / not booting after displaymodeselector –gpumode compute General Discussion	2	1445	November 29, 2023
Issues booting up Ubuntu OS with Nvidia GeForce RTX 3060 Drivers - Linux, Windows, MacOS boot , ubuntu	3	2748	October 29, 2023
RTX 6000 Ada driver not loading after displaymodeselector --gpumode compute Linux	8	1551	August 12, 2023
Ubuntu 22.04 No display after Nvidia Driver installed Linux	2	6094	January 15, 2024
RTX 6000 Ada Linux driver crash GPU - Hardware inception	6	3604	April 26, 2023
One out of three GPUs is not loading driver in Ubuntu 22.04 Linux	2	289	July 17, 2024
RTX ADA 2000 on Linux Linux	0	1120	July 4, 2024
I cannot boot Ubuntu 22.04 after installing latest Nvidia driver Linux	15	4026	April 27, 2023
Ubuntu 20.04 using llvmpipe rather than Nvidia Driver Linux ubuntu	2	3061	November 16, 2021

RTX A6000 Ada after change gpumode to physical_display_disabled ubuntu 22.04 doesn't work

Related topics