Ubuntu Server 22.0.4 LTS not always recognizing all L40S GPUs

Hey everyone, we’re having some very annoying issue where our server is not detecting all of our GPUs. Sometimes we would be lucky and all GPUs are detected. Though, sometimes, only 5, 6 or 7 GPUs are detected (using nvidia-smi).

Kernel options: pci=realloc
Bios options: 4G decoding enabled, resizable BAR enabled
GPU: 8xL40S
Rack: ESC8000A-E12

Some thing I have tried:

  • setting pci=realloc=off

I will attach the logs of nvidia-debug, though, keep in mind that some errors are shown of a previous boot with pci=realloc=off, which didn’t help.

Thank you for your help.

I may consider downgrading ubuntu to 20 or use centos as a last resort, though, maybe you guys know some other things!
nvidia-bug-report.log (5.2 MB)

In the logs, only 7 gpus where initialized by the bios, so the driver can’t do anything. Please check for a bios update, contact mainboard vendor or check the bios for a setting to expand pcie setup time.

Thank you for your swift reply, your posts here have been invaluable to helping me get to the point of recognising a single GPU at all!

Bios is up to date.

Surprisingly I was told that during assembly it always worked with Windows. Any idea why this could be?

I will try contact them and update.

Might also be some timing issue with linux, you could try to rescan the pcie bus/bridges to find gpus not detected at boot time.

Maybe just set grub to visible with a 20 second timeout so there’s additional time before the kernel boots.

Managed to solve it!

Downgraded to ubuntu 20.0.4 LTS (workstation) instead of 22.0.4 server. It was consistently only recognising 7 GPUs in the same order. Changing the kernel parameter pci=realloc managed to solve this issue.

In short; downgrade to 20.0.4 LTS Workstation, pci=realloc

This is rather odd, points to a kernel bug. Which kernel version are you running now (with 22.04 it was 5.15)?

Yeah… this is extremely confusing to me to. I’m afraid I cannot give you the exact number as I’m not near the server and we’re not connected to the internet. Downloaded it directly from Ubuntu 20.04.6 LTS (Focal Fossa)

I have yet to get nvidia-smi working (it was getting late), but lspci was all I need to confirm if things were working properly.

One thing I noticed is that GPUs were always discovered in a fixed order (and consistently the same GPU went undetected) and always 7 out of 8 were detected before pci=realloc.

On 22.0.4 it all seemed arbitrary, lspci gave random orders of GPUs being detected and also random GPUs were undetected.

With 20.04, the kernel version depends on initial choice, GA kernel is 5.4, HWE kernel is 5.15 (same as 22.04 GA).

Info:

Linux version 5.15.0-67-generic (buildd@lcy02-amd64-029) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #74~20.04.1-Ubuntu SMP Wed Feb 22 14:52:34 UTC 2023

Okay, after installing NVIDIA drivers and cuda… the problem returns. GPUs are no longer detected via lspci.

Downgraded to cuda 12.2 with latest 535 driver (as 12.3 installs 545 by default) and this (unscientifically) seems to be more stable, more ofte than not detecting the L40S…

Did you try the “grub workaround”? Please post an lspci output when all 8 gpus are detected.

-+-[0000:e0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
 |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-01.1-[e1]----00.0  NVIDIA Corporation Device [10de:26b9]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.2-[e2]----00.0  Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230]
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           \-07.1-[e3]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
 |                        +-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
 |                        \-00.4  Advanced Micro Devices, Inc. [AMD] Device [1022:14c9]
 +-[0000:c0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
 |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-01.1-[c1]----00.0  NVIDIA Corporation Device [10de:26b9]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.2-[c2]----00.0  Samsung Electronics Co Ltd Device [144d:a80a]
 |           +-03.3-[c3]----00.0  Samsung Electronics Co Ltd Device [144d:a80a]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           \-07.1-[c4]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
 |                        \-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
 +-[0000:a0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
 |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-01.1-[a1]----00.0  NVIDIA Corporation Device [10de:26b9]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           \-07.1-[a2]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
 |                        \-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
 +-[0000:80]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
 |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-01.1-[81]----00.0  NVIDIA Corporation Device [10de:26b9]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.1-[82-83]--+-00.0  Intel Corporation Ethernet Controller X710 for 10GBASE-T [8086:15ff]
 |           |               \-00.1  Intel Corporation Ethernet Controller X710 for 10GBASE-T [8086:15ff]
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           \-07.1-[84]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
 |                        +-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
 |                        +-00.4  Advanced Micro Devices, Inc. [AMD] Device [1022:14c9]
 |                        \-00.5  Advanced Micro Devices, Inc. [AMD] Device [1022:14ca]
 +-[0000:60]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
 |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-01.1-[61]----00.0  NVIDIA Corporation Device [10de:26b9]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.2-[62-63]----00.0-[63]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000]
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           \-07.1-[64]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
 |                        +-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
 |                        \-00.4  Advanced Micro Devices, Inc. [AMD] Device [1022:14c9]
 +-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
 |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-01.1-[41]----00.0  NVIDIA Corporation Device [10de:26b9]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           \-07.1-[42]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
 |                        \-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
 +-[0000:20]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
 |           +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-01.1-[21]----00.0  NVIDIA Corporation Device [10de:26b9]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
 |           \-07.1-[22]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
 |                        \-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
 \-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14a4]
             +-00.2  Advanced Micro Devices, Inc. [AMD] Device [1022:149e]
             +-00.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14a6]
             +-01.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
             +-01.1-[01]----00.0  NVIDIA Corporation Device [10de:26b9]
             +-02.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
             +-03.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
             +-04.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
             +-05.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
             +-07.0  Advanced Micro Devices, Inc. [AMD] Device [1022:149f]
             +-07.1-[02]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ac]
             |            +-00.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14dc]
             |            +-00.4  Advanced Micro Devices, Inc. [AMD] Device [1022:14c9]
             |            \-00.5  Advanced Micro Devices, Inc. [AMD] Device [1022:14ca]
             +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b]
             +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e]
             +-18.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ad]
             +-18.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14ae]
             +-18.2  Advanced Micro Devices, Inc. [AMD] Device [1022:14af]
             +-18.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14b0]
             +-18.4  Advanced Micro Devices, Inc. [AMD] Device [1022:14b1]
             +-18.5  Advanced Micro Devices, Inc. [AMD] Device [1022:14b2]
             +-18.6  Advanced Micro Devices, Inc. [AMD] Device [1022:14b3]
             +-18.7  Advanced Micro Devices, Inc. [AMD] Device [1022:14b4]
             +-19.0  Advanced Micro Devices, Inc. [AMD] Device [1022:14ad]
             +-19.1  Advanced Micro Devices, Inc. [AMD] Device [1022:14ae]
             +-19.2  Advanced Micro Devices, Inc. [AMD] Device [1022:14af]
             +-19.3  Advanced Micro Devices, Inc. [AMD] Device [1022:14b0]
             +-19.4  Advanced Micro Devices, Inc. [AMD] Device [1022:14b1]
             +-19.5  Advanced Micro Devices, Inc. [AMD] Device [1022:14b2]
             +-19.6  Advanced Micro Devices, Inc. [AMD] Device [1022:14b3]
             \-19.7  Advanced Micro Devices, Inc. [AMD] Device [1022:14b4]

This is he lcpi output, will try the “grud workaround” to see if I can consistently detect them.

So it’s the gpu c1:00.0 missing which is on the same root complex as the samsung nvme drives. This is complicating things, doing a rescan of the bus will likely result in a re-enumeration of the drives which is dangerous, likely crashing the system.

Generally pci devices at the “right” side are not detected, I think, so, 10, a1, c1 and e1

Please check if setting kernel parameter
pci=pcie_scan_all
helps detecting the missing gpu and bridge.

I’ve been rarily restarting the server, as it takes 1-5 reboots to get it all GPUs, but I see no improvement with pci=pcie_scan_all