Nvidia-driver doesn't seem to recognize the second GPU

Hello. Nvidia Driver doesn’t recognize the second GPU.
When Nvidia-Driver is executed, only one GPU is recognized.
The above problem will occur even if you swap the two GPUs.

$ nvidia-smi
Wed Mar 10 03:51:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:21:00.0 Off |                  N/A |
|  0%   36C    P8    11W / 370W |     71MiB / 24259MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

can see it as [03: 00.0] in lspci.

lspci -nnv | grep -i nvidia
03:00.0 VGA compatible controller [0300]: **NVIDIA** Corporation Device [10de:2204] (rev a1) (prog-if 00 [VGA controller])
  Subsystem: **NVIDIA** Corporation Device [10de:1454]
  Kernel modules: **nvidia** fb, nouveau, **nvidia** _drm, **nvidia**
03:00.1 Audio device [0403]: **NVIDIA** Corporation Device [10de:1aef] (rev a1)
  Subsystem: **NVIDIA** Corporation Device [10de:1454]
21:00.0 VGA compatible controller [0300]: **NVIDIA** Corporation Device [10de:2204] (rev a1) (prog-if 00 [VGA controller])
  Subsystem: **NVIDIA** Corporation Device [10de:1454]
  Kernel driver in use: **nvidia**
  Kernel modules: **nvidia** fb, nouveau, **nvidia** _drm, **nvidia**
21:00.1 Audio device [0403]: **NVIDIA** Corporation Device [10de:1aef] (rev a1)
  Subsystem: **NVIDIA** Corporation Device [10de:1454]

dmesgでは下記の通り出力されます。

    dmesg | grep nvidia
    [ 1.303608] **nvidia** : loading out-of-tree module taints kernel.
    [ 1.304679] **nvidia** : module license 'NVIDIA' taints kernel.
    [ 1.304959] **nvidia** : module license 'NVIDIA' taints kernel.
    [ 1.348075] **nvidia** : module verification failed: signature and/or required key missing - tainting kernel
    [ 1.370953] **nvidia** -nvlink: Nvlink Core is being initialized, major device number 510
    [ 1.380485] **nvidia** : probe of 0000:03:00.0 failed with error -1
    [ 1.382292] **nvidia** 0000:21:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
    [ 1.464750] **nvidia** -modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 460.32.03 Sun Dec 27 18:51:11 UTC 2020
    [ 1.470398] [drm] [ **nvidia** -drm] [GPU ID 0x00002100] Loading driver
    [ 1.471859] [drm] Initialized **nvidia** -drm 0.0.0 20160202 for 0000:21:00.0 on minor 0
    [ 8.115716] audit: type=1400 audit(1615347063.021:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name=" **nvidia** _modprobe" pid=1355 comm="apparmor_parser"
    [ 8.115720] audit: type=1400 audit(1615347063.021:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name=" **nvidia** _modprobe//kmod" pid=1355 comm="apparmor_parser"

Please tell me the solution.

What i did

  • Motherboard replacement
  • Replacing the graphics card
  • nomodeset
  • disable nouveau
  • GRUB_CMDLINE_LINUX_DEFAULT = “pci = realloc”

Attach a bug report.
nvidia-bug-report.log.gz (505.2 KB)

thank you.

environments:
Ubuntu 20.04 LTS Server(5.8.0-44-generic)
NvidiaDriver: 460.32.03
Graphic Card RTX3090 x 2

Please enable “Above 4G decoding” or similar in bios and remove pci=realloc parameter.

Also:

[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-5.8.0-44-generic root=/dev/mapper/ubuntu–vg-ubuntu–lv ro nomodeset pci=realloc
[ 0.000000] You have booted with nomodeset. This means your GPU drivers are DISABLED
[ 0.000000] Any video related functionality will be severely degraded, and you may not even be able to suspend the system properly
[ 0.000000] Unless you actually understand what nomodeset does, you should reboot without enabling it

you don’t need the nomodeset parameter, as the nvidia drivers are successfully installed and at least one card is working.

Aside this:

[ 197.017] (II) NVIDIA(0): ACPI: failed to connect to the ACPI event daemon; the daemon
[ 197.017] (II) NVIDIA(0): may not be running or the “AcpidSocketPath” X
[ 197.017] (II) NVIDIA(0): configuration option may not be set correctly. When the
[ 197.017] (II) NVIDIA(0): ACPI event daemon is available, the NVIDIA X driver will
[ 197.017] (II) NVIDIA(0): try to use it to receive ACPI event notifications. For
[ 197.017] (II) NVIDIA(0): details, please see the “ConnectToAcpid” and
[ 197.017] (II) NVIDIA(0): “AcpidSocketPath” X configuration options in Appendix B: X
[ 197.017] (II) NVIDIA(0): Config Options in the README.

installing acpid might not hurt.

Thank you for your reply.

“Above 4G decoding” is already enabled.
Paste the screenshot of the settings.

I also disabled “pci = realloc”, but it didn’t improve.

We will upload the bug report again. Thank you.

nvidia-bug-report.log.gz (451.7 KB)

The above 4G option doesn’t have any effect, resources are still only 32bit:

[    0.300759] pci_bus 0000:00: root bus resource [io  0x0000-0x03af window]
[    0.300760] pci_bus 0000:00: root bus resource [io  0x03e0-0x0cf7 window]
[    0.300761] pci_bus 0000:00: root bus resource [io  0x03b0-0x03df window]
[    0.300763] pci_bus 0000:00: root bus resource [io  0x0d00-0x3fff window]
[    0.300764] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.300766] pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000dffff window]
[    0.300768] pci_bus 0000:00: root bus resource [mem 0xf8000000-0xfbfffffe window]
[    0.300770] pci_bus 0000:00: root bus resource [bus 00-1f]

Please make sure you have CSM disabled as this overrides the 64bit BAR option on some boards/bioses.

Thank you for your reply.

I tried installing “acpid” but it didn’t work.

Also, UEFI mode has been specified to disable CMS.

Then you’ll have to contact MSI since the Above 4G memory option obviously doesn’t work.

Since you have the Re-Size BAR option (maybe disable that to test) you’re running the beta bios. Does downgrading the bios to the last stable version make 64bit BARs work?
You can check by running
sudo dmesg |grep “root bus resource”
If functional, a mem window with address larger than 32bits shows up.

Yes, the board uses “TRX40 PRO WIFI” and the BIOS version is “7C60v275 (Beta)”.

Confirm that you have presented it.

I will reply to you tomorrow, thank you.

I confirmed the following 4 points.

sudo dmesg | grep “root bus resource”

A mem window with an address smaller than 32 bits show up.

Bios “7C60v275 (Beta)”

  • Re-Size BAR ON/OFF
  • Above 4G memory/ Crypto Currency mining ON/OFF

Bios “7C60v26 stable”

  • Above 4G memory/ Crypto Currency mining ON
  • Above 4G memory/ Crypto Currency mining OFF

Curiously, when I turned off Above 4G memory / cryptocurrency mining on the “7C60v26”, the Nvidia driver recognized the second GPU.

Contact MSI for the above.

Thank you very much.

Just for curiosity, please post the output of sudo dmesg |grep “root bus resource”
in the working state.

I have already stopped the server, but I got the following output.

$ nvidia-smi
Thu Mar 11 01:08:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. | 
|   0  GeForce RTX 3090    Off  | 00000000:03:00.0 Off |                  N/A |
| 37%   31C    P0   109W / 370W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:21:00.0  On |                  N/A |
| 37%   35C    P0   114W / 370W |     51MiB / 24259MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

$ sudo dmesg | grep "root bus resource" 
[    0.303619] pci_bus 0000:00: root bus resource [io  0x0000-0x03af window]
[    0.303620] pci_bus 0000:00: root bus resource [io  0x03e0-0x0cf7 window]
[    0.303621] pci_bus 0000:00: root bus resource [io  0x03b0-0x03df window]
[    0.303623] pci_bus 0000:00: root bus resource [io  0x0d00-0x3fff window]
[    0.303624] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.303626] pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000dffff window]
[    0.303628] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xe1fffffe window]
[    0.303629] pci_bus 0000:00: root bus resource [bus 00-1f]
[    0.308491] pci_bus 0000:20: root bus resource [io  0x4000-0x6fff window]
[    0.308493] pci_bus 0000:20: root bus resource [mem 0x90000000-0xb1fffffe window]
[    0.308494] pci_bus 0000:20: root bus resource [bus 20-3f]
[    0.311865] pci_bus 0000:40: root bus resource [io  0x7000-0xcfff window]
[    0.311866] pci_bus 0000:40: root bus resource [mem 0xb4000000-0xb68ffffe window]
[    0.311868] pci_bus 0000:40: root bus resource [bus 40-5f]
[    0.325022] pci_bus 0000:60: root bus resource [io  0xd000-0xffff window]
[    0.325024] pci_bus 0000:60: root bus resource [mem 0xf8000000-0xf9ffffff window]
[    0.325025] pci_bus 0000:60: root bus resource [bus 60-ff]

$ dmesg | grep nvidia
[    1.297505] nvidia: loading out-of-tree module taints kernel.
[    1.298393] nvidia: module license 'NVIDIA' taints kernel.
[    1.310678] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    1.321854] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[    1.322948] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    1.372027] nvidia 0000:21:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    1.462432] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  460.32.03  Sun Dec 27 18:51:11 UTC 2020
[    1.465244] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[    1.465952] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 0
[    1.466695] [drm] [nvidia-drm] [GPU ID 0x00002100] Loading driver
[    1.468840] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:21:00.0 on minor 1
[    8.125125] audit: type=1400 audit(1615424920.027:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1341 comm="apparmor_parser"
[    8.125128] audit: type=1400 audit(1615424920.027:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1341 comm="apparmor_parser"

I’ve taken a deeper look and it’s obviously a bios bug.
This board has 4 root busses, one for each slot. Enabling the Above 4G option enables 64bit resources on slot 1-3 but breaks the slot 0 so it’s not capable of running a gpu in it anymore.

Addendum :
Above 4G ON: slots1,2,3 usable
Above 4G OFF: slots 0,1 usable

Thank you for looking into this in depth.
I am inquiring with MSI about this.I will post again if there is an update.

After contacting MSI, it turned out to be caused by a bug in the BIOS.
We received a bug-fixed version of the BIOS on a test basis and confirmed that the problem was resolved.
The fixed BIOS will be updated with AMD’s updates in the future.

Thank you very much.