I have an HP Z8 G4 previously with one P40 in it. I had some problems getting the first P40 to run properly. Unfortunately I don’t remember what I had to do to make it work. If I remember correctly, it had to do with flexible BAR but I could be wrong here. Now I bought a second P40 and for some reason it doesn’t like the second one.
I noticed this post: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid and tried with the boot parameters pci=realloc as well as pci=realloc=off (see code below) but both of those also didn’t fix it. I wouldn’t think so because the first card is detected properly.
EDIT: I tried removing pci=realloc, with pci=realloc and pci=realloc=off. Both exactly the same for me. The first GPU is recognized, the second is but with the I/O region assigned error.
EDIT2: I just bought 2 PCIe riser ribbon cables. I’m going to play with the PCIe slots the cards are in. I think I have to move them but the cards are too wide to support that configuration. It might work with riser cables though.
root@zed:~# dmesg | grep -i nvidia
[ 16.342189] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 16.349533] nvidia 0000:2d:00.0: enabling device (0140 -> 0142)
[ 16.467246] nvidia 0000:99:00.0: enabling device (0140 -> 0142)
[ 16.470412] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 16.475530] nvidia: probe of 0000:99:00.0 failed with error -1
[ 16.478614] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 16.480769] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.85.12 Sat Jan 28 02:10:06 UTC 2023
[ 16.495815] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.85.12 Sat Jan 28 02:03:23 UTC 2023
[ 16.504564] [drm] [nvidia-drm] [GPU ID 0x00002d00] Loading driver
[ 16.505841] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:2d:00.0 on minor 1
[ 19.280575] audit: type=1400 audit(1720324468.954:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1909 comm="apparmor_parser"
[ 19.281669] audit: type=1400 audit(1720324468.954:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1909 comm="apparmor_parser"
[ 30.414267] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 30.449570] nvidia-uvm: Loaded the UVM driver, major device number 509.
root@zed:~#
The other GPU still works fine though. Eg. I can run a LLM with ollama with those 2 P40’s in the workstation. The first is used correctly, the second isn’t.
root@zed:~# ollama run gemma2
>>> hello
Hello! 👋 How can I help you today? 😊
>>> /bye
root@zed:~# nvidia-smi
Sun Jul 7 06:08:59 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:2D:00.0 Off | Off |
| N/A 33C P0 49W / 250W | 6646MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2400 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 6621 C ...a_v11/ollama_llama_server 6640MiB |
+-----------------------------------------------------------------------------+
root@zed:~#
Some more information. I also have a RADEON 8670 in this workstation. It is a dual Xeon motherboard without built in graphics. I guess for it to boot it requires a GPU with some sort of display output. That’s why the RADEON is in there. It was the only GPU I had around to fix that problem.
root@zed:~# lspci -v | grep -i nvidia ; lshw -class video ; uname -r
2d:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
Subsystem: NVIDIA Corporation GP102GL [Tesla P40]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
99:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
Subsystem: NVIDIA Corporation GP102GL [Tesla P40]
Kernel modules: nouveau, nvidia_drm, nvidia
*-display
description: VGA compatible controller
product: Oland XT [Radeon HD 8670 / R5 340X OEM / R7 250/350/350X OEM]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:15:00.0
logical name: /dev/fb0
version: 83
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=radeon latency=0 resolution=2560,1440
resources: irq:211 memory:a0000000-afffffff memory:98000000-9803ffff ioport:4000(size=256) memory:98060000-9807ffff
*-display
description: 3D controller
product: GP102GL [Tesla P40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:2d:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:e80-e7f iomemory:f00-eff irq:238 memory:b9000000-b9ffffff memory:e800000000-efffffffff memory:f000000000-f001ffffff
*-display UNCLAIMED
description: 3D controller
product: GP102GL [Tesla P40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:99:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress cap_list
configuration: latency=0
resources: memory:e1000000-e1ffffff
6.1.0-17-amd64
root@zed:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.1.0-17-amd64 root=UUID=26cf4699-b6db-4cb7-b6d5-f66dd93aba3c ro pci=realloc=off
root@zed:~# cat /etc/debian_version
12.5
root@zed:~#