Server: HP Proliant DL580 G7
4x Xeon E7-4870
512GB DDR3-1333 ECC Registered Memory
PCIe Expansion Chassis with 2x x16 (x16 electrical) slots
2x NVIDIA GRID M40 cards plugged into the native x16 PCIe 3.0 slots
2x ioDrive2 1280GB High endurance cards
1x HP SmartArray P822 2GB FBWC RAID card
1x 7TB RAID60E Logical Volume (54x SAS2 146GB 15K SFF Drives)
- 2x 400GB SAS2 SFF SSD Read/Write Cache
1x 800GB SAS2 SFF SSD RAID0 Boot Drive (2x 400GB SAS2 SSD)
Driver: Tesla for Linux 384.145 (CUDA 9.0)
Issue: The first of four GPUs on the second GRID M40 card does not get initialized, no kernel module gets loaded for it.
Issue: After POST in the BIOS, the server reports that it is out of PCIe resources. I have seen how to fix this on a SuperMicro server, but none of those options exist in the HP BIOS. I am thinking the reason the GPU isn’t being loaded is due to PCI resource allocation.
$ sudo lshw -C display
*-display
description: VGA compatible controller
product: ES1000
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 3
bus info: pci@0000:01:03.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pm vga_controller bus_master cap_list rom
configuration: driver=radeon latency=64 mingnt=8
resources: irq:23 memory:68000000-6fffffff ioport:2000(size=256) memory:60310000-6031ffff memory:60320000-6033ffff
*-display UNCLAIMED
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:91:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller cap_list
configuration: latency=0
resources: memory:86000000-86ffffff memory:e2000000-e3ffffff ioport:a000(size=128) memory:e4000000-e407ffff
*-display
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:92:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:29 memory:84000000-84ffffff memory:d0000000-dfffffff memory:c2000000-c3ffffff ioport:b000(size=128) memory:c4000000-c407ffff
*-display
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:93:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:26 memory:82000000-82ffffff memory:b0000000-bfffffff memory:a6000000-a7ffffff ioport:c000(size=128) memory:a8000000-a807ffff
*-display
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:94:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:29 memory:80000000-80ffffff memory:90000000-9fffffff memory:a2000000-a3ffffff ioport:d000(size=128) memory:a0000000-a007ffff
*-display
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:88:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:33 memory:8e000000-8effffff memory:fc080000000-fc08fffffff memory:fc090000000-fc091ffffff ioport:6000(size=128)
*-display
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:89:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:36 memory:8c000000-8cffffff memory:fc0a0000000-fc0afffffff memory:fc098000000-fc099ffffff ioport:7000(size=128)
*-display
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:8a:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:33 memory:8a000000-8affffff memory:fc0b0000000-fc0bfffffff memory:fc0c0000000-fc0c1ffffff ioport:8000(size=128)
*-display
description: VGA compatible controller
product: GM107GL [GRID M40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:8b:00.0
version: a2
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:fc00-fbff iomemory:fc00-fbff irq:36 memory:88000000-88ffffff memory:fc0d0000000-fc0dfffffff memory:fc0c8000000-fc0c9ffffff ioport:9000(size=128)
Notice that there is no irq assigned to GPU pci@0000:91:00.0.
$ nvidia-smi
Sat Dec 8 12:02:31 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|=============================================================================|
| 0 GRID M40 Off | 00000000:88:00.0 Off | N/A |
| 44% 46C P0 16W / 53W | 0MiB / 4042MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID M40 Off | 00000000:89:00.0 Off | N/A |
| 42% 43C P0 16W / 53W | 0MiB / 4042MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID M40 Off | 00000000:8A:00.0 Off | N/A |
| 38% 32C P0 16W / 53W | 0MiB / 4042MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID M40 Off | 00000000:8B:00.0 Off | N/A |
| 38% 36C P0 16W / 53W | 0MiB / 4042MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GRID M40 Off | 00000000:92:00.0 Off | N/A |
| 40% 42C P0 16W / 53W | 0MiB / 4042MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GRID M40 Off | 00000000:93:00.0 Off | N/A |
| 38% 31C P0 16W / 53W | 0MiB / 4042MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GRID M40 Off | 00000000:94:00.0 Off | N/A |
| 0% 35C P0 15W / 53W | 0MiB / 4042MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 CPU Affinity
GPU0 X PIX PIX PIX PHB PHB PHB 20-29,60-69
GPU1 PIX X PIX PIX PHB PHB PHB 20-29,60-69
GPU2 PIX PIX X PIX PHB PHB PHB 20-29,60-69
GPU3 PIX PIX PIX X PHB PHB PHB 20-29,60-69
GPU4 PHB PHB PHB PHB X PIX PIX 20-29,60-69
GPU5 PHB PHB PHB PHB PIX X PIX 20-29,60-69
GPU6 PHB PHB PHB PHB PIX PIX X 20-29,60-69
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
$ lspci -knn
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
8a:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
8b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
91:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
92:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
93:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
94:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [GRID M40] [10de:13bd] (rev a2)
Subsystem: NVIDIA Corporation GM107GL [GRID M40] [10de:110a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
Notice that GPU 91:00.0 has no kernel driver in use.
nvidia-bug-report.log.gz:
Any suggestions/ideas would be greatly appreciated, thank you! Please let me know if there is any more information needed.