NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

snapeandcandy · October 5, 2022, 8:56am

Dear all,

I’m unable to have A5000 cards work with supermico board X12DPG-OA6. dmesg shows:

[   37.010271] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:4f:00.0)
[   37.010350] nvidia: probe of 0000:4f:00.0 failed with error -1
[   37.010449] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:52:00.0)
[   37.010538] nvidia: probe of 0000:52:00.0 failed with error -1
[   37.010658] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:56:00.0)
[   37.010736] nvidia: probe of 0000:56:00.0 failed with error -1
[   37.010853] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:57:00.0)
[   37.010936] nvidia: probe of 0000:57:00.0 failed with error -1
[   37.011009] NVRM: The NVIDIA probe routine failed for 4 device(s).

The fresh boot dmesg log is in the attachment.

I configured the BIOS as suggested in several topics

Disable secure boot
Disable CMS
Enable “Above 4G Decoding”

I’ve tried reinstalling the OS (tried ubuntu-server 18.04 and ubuntu-desktop 18.04), but the problem still persists.

What should I do to make it works? Thank you all.

dmesg.log (204.6 KB)

generix · October 5, 2022, 9:27am

Please set kernel parameter
pci=realloc
if that doesn’t fix it, try
pci=realloc=off

snapeandcandy · October 6, 2022, 2:35am

Thank you, it does work with pci=realloc.

How did you know the parameter? I would like to learn more about this issue to be able to fix it myself, what should I begin with?

Thank you again.

generix · October 6, 2022, 8:59am

It’s a very common problem with pci resource allocation, i.e. the memory window sizes and regions a pci device wants (BAR). Initially assigned by the bios but sometimes incorrectly/incompatible so pci=realloc enables the kernel to change the regions.

snapeandcandy · October 18, 2022, 2:19am

Hi again,

After installing the driver, should I keep the pci=realloc parameter?
I still have the issues with nvidia-smi, is this related to this topic?

$ nvidia-smi
Tue Oct 18 09:11:07 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:4F:00.0 Off |                  Off |
|ERR!   32C    P8    16W / 230W |     13MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:52:00.0 Off |                  Off |
|ERR!   33C    P8    16W / 230W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    On   | 00000000:56:00.0 Off |                  Off |
|ERR!   33C    P8    14W / 230W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    On   | 00000000:57:00.0 Off |                  Off |
| 38%   64C    P2   189W / 230W |  12365MiB / 24564MiB |     75%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43897      C   python3                            10MiB |
|    3   N/A  N/A     29695      C   ...da3/envs/test/bin/python3    12362MiB |
+-----------------------------------------------------------------------------+

generix · October 19, 2022, 9:22am

The parameter is needed for the kernel to work properly with your mainboard, so this has to stay permanently. Unless a bios update is released that fixes it.
The ERR! state can be triggered either by overheating memory (I don’t think so, looking at the temperature of the working gpu) or not having configured the nvidia-persistenced daemon to start on boot. Please check for that.

stephen.jay · October 26, 2022, 2:59pm

Hey generix - should the nvidia-persistenced daemon NOT be configured to start at boot? Could you elaborate please?

snapeandcandy · October 26, 2022, 3:49pm

I think Nvidia has a documentation on persistent Driver Persistence :: GPU Deployment and Management Documentation

snapeandcandy · November 2, 2022, 6:40am

Just an update, turn out the power configuration of the server is 2+2. The ERR! only happened when 1 PSU is connected.

system · November 16, 2022, 6:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
A30 NVRM: This PCI I/O region assigned to your NVIDIA device is invalid Linux	3	2617	November 9, 2022
This PCI I/O region assigned to your NVIDIA device is invalid: Linux cuda	5	5532	October 12, 2021
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid Linux	39	16729	October 12, 2021
Nvidia-smi "No devices were found" Linux kernel , ubuntu , driver	8	1189	July 23, 2024
Adding second P40: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid Linux	0	324	July 7, 2024
This PCI I/O region assigned to your NVIDIA device is invalid Linux	0	632	April 27, 2023
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: BAR1 is 0M @ 0x0 (PCI:0000:0b.0) Linux ubuntu	2	2854	July 17, 2023
NVIDIA A100 won't bind to drivers on SuperMicro M11SDV-8C+-LN4F Linux	6	935	March 28, 2023
Ubuntu 22.04 I drivers all failed (No devices were found) CUDA Setup and Installation nvidia-smi	2	974	December 21, 2023
GPU failing to initialize Linux	0	541	May 2, 2023

NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

Related topics