NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

Dear all,

I’m unable to have A5000 cards work with supermico board X12DPG-OA6. dmesg shows:

[   37.010271] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:4f:00.0)
[   37.010350] nvidia: probe of 0000:4f:00.0 failed with error -1
[   37.010449] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:52:00.0)
[   37.010538] nvidia: probe of 0000:52:00.0 failed with error -1
[   37.010658] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:56:00.0)
[   37.010736] nvidia: probe of 0000:56:00.0 failed with error -1
[   37.010853] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:57:00.0)
[   37.010936] nvidia: probe of 0000:57:00.0 failed with error -1
[   37.011009] NVRM: The NVIDIA probe routine failed for 4 device(s).

The fresh boot dmesg log is in the attachment.

I configured the BIOS as suggested in several topics

  • Disable secure boot
  • Disable CMS
  • Enable “Above 4G Decoding”

I’ve tried reinstalling the OS (tried ubuntu-server 18.04 and ubuntu-desktop 18.04), but the problem still persists.

What should I do to make it works? Thank you all.

dmesg.log (204.6 KB)

Please set kernel parameter
pci=realloc
if that doesn’t fix it, try
pci=realloc=off

3 Likes

Thank you, it does work with pci=realloc.

How did you know the parameter? I would like to learn more about this issue to be able to fix it myself, what should I begin with?

Thank you again.

It’s a very common problem with pci resource allocation, i.e. the memory window sizes and regions a pci device wants (BAR). Initially assigned by the bios but sometimes incorrectly/incompatible so pci=realloc enables the kernel to change the regions.

2 Likes

Hi again,

After installing the driver, should I keep the pci=realloc parameter?
I still have the issues with nvidia-smi, is this related to this topic?

$ nvidia-smi
Tue Oct 18 09:11:07 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:4F:00.0 Off |                  Off |
|ERR!   32C    P8    16W / 230W |     13MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:52:00.0 Off |                  Off |
|ERR!   33C    P8    16W / 230W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    On   | 00000000:56:00.0 Off |                  Off |
|ERR!   33C    P8    14W / 230W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    On   | 00000000:57:00.0 Off |                  Off |
| 38%   64C    P2   189W / 230W |  12365MiB / 24564MiB |     75%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43897      C   python3                            10MiB |
|    3   N/A  N/A     29695      C   ...da3/envs/test/bin/python3    12362MiB |
+-----------------------------------------------------------------------------+

The parameter is needed for the kernel to work properly with your mainboard, so this has to stay permanently. Unless a bios update is released that fixes it.
The ERR! state can be triggered either by overheating memory (I don’t think so, looking at the temperature of the working gpu) or not having configured the nvidia-persistenced daemon to start on boot. Please check for that.

1 Like

Hey generix - should the nvidia-persistenced daemon NOT be configured to start at boot? Could you elaborate please?

I think Nvidia has a documentation on persistent Driver Persistence :: GPU Deployment and Management Documentation

Just an update, turn out the power configuration of the server is 2+2. The ERR! only happened when 1 PSU is connected.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.