Only 7 of 8 GPUs are loaded: Dual GRID M40's on HP Proliant DL580 G7 Ubuntu 16.04.5 Server

Server: HP Proliant DL580 G7
4x Xeon E7-4870
512GB DDR3-1333 ECC Registered Memory
PCIe Expansion Chassis with 2x x16 (x16 electrical) slots
2x NVIDIA GRID M40 cards plugged into the native x16 PCIe 3.0 slots
2x ioDrive2 1280GB High endurance cards
1x HP SmartArray P822 2GB FBWC RAID card
1x 7TB RAID60E Logical Volume (54x SAS2 146GB 15K SFF Drives)
- 2x 400GB SAS2 SFF SSD Read/Write Cache
1x 800GB SAS2 SFF SSD RAID0 Boot Drive (2x 400GB SAS2 SSD)

Driver: Tesla for Linux 384.145 (CUDA 9.0)

Issue: The first of four GPUs on the second GRID M40 card does not get initialized, no kernel module gets loaded for it.

Issue: After POST in the BIOS, the server reports that it is out of PCIe resources. I have seen how to fix this on a SuperMicro server, but none of those options exist in the HP BIOS. I am thinking the reason the GPU isn’t being loaded is due to PCI resource allocation.

$ sudo lshw -C display
   *-display               
       description: VGA compatible controller
       product: ES1000
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 3
       bus info: pci@0000:01:03.0
       version: 02
       width: 32 bits
       clock: 33MHz
       capabilities: pm vga_controller bus_master cap_list rom
       configuration: driver=radeon latency=64 mingnt=8
       resources: irq:23 memory:68000000-6fffffff ioport:2000(size=256) memory:60310000-6031ffff memory:60320000-6033ffff
  *-display UNCLAIMED
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:91:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller cap_list
       configuration: latency=0
       resources: memory:86000000-86ffffff memory:e2000000-e3ffffff ioport:a000(size=128) memory:e4000000-e407ffff
  *-display
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:92:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:29 memory:84000000-84ffffff memory:d0000000-dfffffff memory:c2000000-c3ffffff ioport:b000(size=128) memory:c4000000-c407ffff
  *-display
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:93:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:26 memory:82000000-82ffffff memory:b0000000-bfffffff memory:a6000000-a7ffffff ioport:c000(size=128) memory:a8000000-a807ffff
  *-display
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:94:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:29 memory:80000000-80ffffff memory:90000000-9fffffff memory:a2000000-a3ffffff ioport:d000(size=128) memory:a0000000-a007ffff
  *-display
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:88:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list
       configuration: driver=nvidia latency=0
       resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:33 memory:8e000000-8effffff memory:fc080000000-fc08fffffff memory:fc090000000-fc091ffffff ioport:6000(size=128)
  *-display
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:89:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list
       configuration: driver=nvidia latency=0
       resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:36 memory:8c000000-8cffffff memory:fc0a0000000-fc0afffffff memory:fc098000000-fc099ffffff ioport:7000(size=128)
  *-display
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:8a:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list
       configuration: driver=nvidia latency=0
       resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:33 memory:8a000000-8affffff memory:fc0b0000000-fc0bfffffff memory:fc0c0000000-fc0c1ffffff ioport:8000(size=128)
  *-display
       description: VGA compatible controller
       product: GM107GL [GRID M40]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:8b:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list
       configuration: driver=nvidia latency=0
       resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:36 memory:88000000-88ffffff memory:fc0d0000000-fc0dfffffff memory:fc0c8000000-fc0c9ffffff ioport:9000(size=128)

Notice that there is no irq assigned to GPU pci@0000:91:00.0.

$ nvidia-smi
Sat Dec  8 12:02:31 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|=============================================================================|
|   0  GRID M40            Off  | 00000000:88:00.0 Off |                  N/A |
| 44%   46C    P0    16W /  53W |      0MiB /  4042MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID M40            Off  | 00000000:89:00.0 Off |                  N/A |
| 42%   43C    P0    16W /  53W |      0MiB /  4042MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GRID M40            Off  | 00000000:8A:00.0 Off |                  N/A |
| 38%   32C    P0    16W /  53W |      0MiB /  4042MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GRID M40            Off  | 00000000:8B:00.0 Off |                  N/A |
| 38%   36C    P0    16W /  53W |      0MiB /  4042MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GRID M40            Off  | 00000000:92:00.0 Off |                  N/A |
| 40%   42C    P0    16W /  53W |      0MiB /  4042MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GRID M40            Off  | 00000000:93:00.0 Off |                  N/A |
| 38%   31C    P0    16W /  53W |      0MiB /  4042MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GRID M40            Off  | 00000000:94:00.0 Off |                  N/A |
|  0%   35C    P0    15W /  53W |      0MiB /  4042MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory 			              |
|  GPU       PID   Type   Process name                             Usage     			      |
|=============================================================================|
|  No running processes found                                            					      |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	CPU Affinity
GPU0	 X 	PIX	PIX	PIX	PHB	PHB	PHB	20-29,60-69
GPU1	PIX	 X 	PIX	PIX	PHB	PHB	PHB	20-29,60-69
GPU2	PIX	PIX	 X 	PIX	PHB	PHB	PHB	20-29,60-69
GPU3	PIX	PIX	PIX	 X 	PHB	PHB	PHB	20-29,60-69
GPU4	PHB	PHB	PHB	PHB	 X 	PIX	PIX	20-29,60-69
GPU5	PHB	PHB	PHB	PHB	PIX	 X 	PIX	20-29,60-69
GPU6	PHB	PHB	PHB	PHB	PIX	PIX	 X 	20-29,60-69

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
$ lspci -knn 
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
8a:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
8b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
91:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
92:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
93:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
94:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
	Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

Notice that GPU 91:00.0 has no kernel driver in use.

nvidia-bug-report.log.gz:

Any suggestions/ideas would be greatly appreciated, thank you! Please let me know if there is any more information needed.

The bios fails to assign a proper pci memory window for BAR1. See if a bios update is available or contact the manufacturer of the board.

Hello. How did you fix the mistake? I’m issuing the error “NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR1 is 0M @ 0x0”. where Slots has the Tesla M40?