We’ve been running two A100 GPUs in our system with no problem. We are now trying to get one of the GPUs to run in an extension chassis. The cards show up in lspci but the one in the chassis has no driver associated with it:
user@syseng-2-dell-hpc:~$ lspci -v -s 25:00.0
25:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
Physical Slot: 2-1
Flags: bus master, fast devsel, latency 0, IRQ 774, NUMA node 0, IOMMU group 29
Memory at 98000000 (32-bit, non-prefetchable) [size=16M]
Memory at 1e000000000 (64-bit, prefetchable) [size=128G]
Memory at 1d000000000 (64-bit, prefetchable) [size=32M]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
user@syseng-2-dell-hpc:~$ lspci -v -s 55:00.0
55:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 0, IOMMU group 70
Memory at <ignored> (32-bit, non-prefetchable)
Memory at <ignored> (64-bit, prefetchable)
Memory at <ignored> (64-bit, prefetchable)
Capabilities: <access denied>
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
The following is the nvidia-smi:
user@syseng-2-dell-hpc:~$ nvidia-smi
Tue Jan 31 16:19:58 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:25:00.0 Off | 0 |
| N/A 36C P0 46W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The following is the dmesg output:
user@syseng-2-dell-hpc:~$ sudo dmesg | grep nvidia
[ 28.512680] nvidia: loading out-of-tree module taints kernel.
[ 28.512696] nvidia: module license 'NVIDIA' taints kernel.
[ 28.533721] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 28.554023] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 28.925163] nvidia: probe of 0000:55:00.0 failed with error -1
[ 28.957357] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.78.01 Mon Dec 26 05:38:56 UTC 2022
[ 28.964860] [drm] [nvidia-drm] [GPU ID 0x00002500] Loading driver
[ 30.615877] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:25:00.0 on minor 1
[ 33.110011] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 33.122440] nvidia-uvm: Loaded the UVM driver, major device number 507.
[ 33.408423] audit: type=1400 audit(1675181119.445:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1898 comm="apparmor_parser"
[ 33.408428] audit: type=1400 audit(1675181119.445:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1898 comm="apparmor_parser"
Any recommendations at what to look at would be greatly appreciated. Please let me know if there is any other information that would be helpful.
Tony