First of all, I will concede that with all my research, the Dell Precision 7820 + P40 configuration is not compatible as stated by Dell and NVIDIA. However, I am not entirely clear as to WHY it’s not compatible. The best I’ve been able to uncover is I’m dealing with a firmware issue where the memory mapping appears to be the root issue. SBIOS\VBIOS mods seems to be my only way forward and this is where I stop. Dell BIOS mods are complex and VBIOS flashing is out of the question as it won’t work without the driver properly recognizing the device.
I have tracked down the list of officially supported GPUs by Dell and found a GPU that would seem to best serve my purpose: The Quadro P6000. However, before making another purchase mistake, I would like to explore what is actually going on. Both run at 24GB of VRAM, so if I’m running into issues with the P40, why wouldn’t the P6000 have the same issues? An interesting note, I have tested and verified that the 2x Tesla P4s work without issue.
Dell Precision T7820
- 2x Xeon Platinum 8160 (Skylake 6th Gen)
- 187 GiB DDR4
- Linux Mint 21.3 (Ubuntu) : Kernel 5.15/6.8 Generic
- NVIDIA Driver - 535.171.04/550.90.07
- Target Application - LLM (ollama): Lots of VRAM needed.
Here is what I have so far to accelerate the conversation
- UEFI Enabled
- 4G Encoding Enabled
- CSM/Legacy Mode Disabled
- Most Recent SBIOS (2.40.0)
- VT-d Disabled
- No Resizeable Bar Option in SBIOS. Not supported by Dell.
* ReBarState set to 16 (64GB)
* Can be set to any size needed or disabled
- Properly Powered (2x PCIe 8 Pin → EPS 8 Pin)
- Passive Cooling Solution
- Grub Options Tried: pci=realloc/pci=realloc=off
- lspci shows P40 is connected
- nvidia-smi shows P40 missing and not recognized by driver
- Removed 2nd CPU riser (single NUMA node)
Target PCIe Configuration (assumed best but not 100% sure)
Max Bus: 256
Gen3 PCIe - Bios Option 1
Slot Lanes
[1] 18 - [empty]
[2] x16 90 - Tesla P40 (75w)
[3] 11 - [empty]
[4] x16 91 - Quadro P600 (75w)
[5] 11 - NVMe Drive (4Tb)
Most notable dmesg entries
Jul 22 07:12:25 bai kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:ed:00.0)
-------------------------------------------------------------------------------------------------
Jul 22 06:59:02 bai kernel: NVRM: GPU 0000:ab:00.0: RmInitAdapter failed! (0x24:0x72:1556)
Jul 22 06:59:02 bai kernel: NVRM: GPU 0000:ab:00.0: rm_init_adapter failed, device minor number 1
I have tried disabling several of the integrated devices like audio/networking/usb busses etc, with no relief. I have attached the diagnostics below. I have also verified that the Tesla P40 in question works by testing in another system (my buddie’s computer).
I would love to see if this can be explored a bit better, and hopefully give some answers for others that may attempt a similar path and not make same mistakes I did. I have collected a ton of information and can provide any additional details requested.
nvidia-bug-report.log (3.5 MB)