Adding second P40: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

I have an HP Z8 G4 previously with one P40 in it. I had some problems getting the first P40 to run properly. Unfortunately I don’t remember what I had to do to make it work. If I remember correctly, it had to do with flexible BAR but I could be wrong here. Now I bought a second P40 and for some reason it doesn’t like the second one.

I noticed this post: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid and tried with the boot parameters pci=realloc as well as pci=realloc=off (see code below) but both of those also didn’t fix it. I wouldn’t think so because the first card is detected properly.

EDIT: I tried removing pci=realloc, with pci=realloc and pci=realloc=off. Both exactly the same for me. The first GPU is recognized, the second is but with the I/O region assigned error.

EDIT2: I just bought 2 PCIe riser ribbon cables. I’m going to play with the PCIe slots the cards are in. I think I have to move them but the cards are too wide to support that configuration. It might work with riser cables though.

root@zed:~# dmesg | grep -i nvidia
[   16.342189] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[   16.349533] nvidia 0000:2d:00.0: enabling device (0140 -> 0142)
[   16.467246] nvidia 0000:99:00.0: enabling device (0140 -> 0142)
[   16.470412] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[   16.475530] nvidia: probe of 0000:99:00.0 failed with error -1
[   16.478614] NVRM: The NVIDIA probe routine failed for 1 device(s).
[   16.480769] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.85.12  Sat Jan 28 02:10:06 UTC 2023
[   16.495815] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.85.12  Sat Jan 28 02:03:23 UTC 2023
[   16.504564] [drm] [nvidia-drm] [GPU ID 0x00002d00] Loading driver
[   16.505841] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:2d:00.0 on minor 1
[   19.280575] audit: type=1400 audit(1720324468.954:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1909 comm="apparmor_parser"
[   19.281669] audit: type=1400 audit(1720324468.954:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1909 comm="apparmor_parser"
[   30.414267] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   30.449570] nvidia-uvm: Loaded the UVM driver, major device number 509.
root@zed:~# 

The other GPU still works fine though. Eg. I can run a LLM with ollama with those 2 P40’s in the workstation. The first is used correctly, the second isn’t.

root@zed:~# ollama run gemma2
>>> hello
Hello! 👋 How can I help you today? 😊

>>> /bye
root@zed:~# nvidia-smi
Sun Jul  7 06:08:59 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:2D:00.0 Off |                  Off |
| N/A   33C    P0    49W / 250W |   6646MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2400      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      6621      C   ...a_v11/ollama_llama_server     6640MiB |
+-----------------------------------------------------------------------------+
root@zed:~# 
 

Some more information. I also have a RADEON 8670 in this workstation. It is a dual Xeon motherboard without built in graphics. I guess for it to boot it requires a GPU with some sort of display output. That’s why the RADEON is in there. It was the only GPU I had around to fix that problem.

root@zed:~# lspci -v | grep -i nvidia ; lshw -class video ; uname -r
2d:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
	Subsystem: NVIDIA Corporation GP102GL [Tesla P40]
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia
99:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
	Subsystem: NVIDIA Corporation GP102GL [Tesla P40]
	Kernel modules: nouveau, nvidia_drm, nvidia
  *-display                 
       description: VGA compatible controller
       product: Oland XT [Radeon HD 8670 / R5 340X OEM / R7 250/350/350X OEM]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:15:00.0
       logical name: /dev/fb0
       version: 83
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=radeon latency=0 resolution=2560,1440
       resources: irq:211 memory:a0000000-afffffff memory:98000000-9803ffff ioport:4000(size=256) memory:98060000-9807ffff
  *-display
       description: 3D controller
       product: GP102GL [Tesla P40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:2d:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list
       configuration: driver=nvidia latency=0
       resources: iomemory:e80-e7f iomemory:f00-eff irq:238 memory:b9000000-b9ffffff memory:e800000000-efffffffff memory:f000000000-f001ffffff
  *-display UNCLAIMED
       description: 3D controller
       product: GP102GL [Tesla P40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:99:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress cap_list
       configuration: latency=0
       resources: memory:e1000000-e1ffffff
6.1.0-17-amd64
root@zed:~# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-6.1.0-17-amd64 root=UUID=26cf4699-b6db-4cb7-b6d5-f66dd93aba3c ro pci=realloc=off
root@zed:~# cat /etc/debian_version 
12.5
root@zed:~#