I wonder if anyone can help me with this issue:
I have a HPE Apollo XL675D with 8x NVIDIA A100-SXM4-40GB GPU’s in it.
Running Ubuntu Server 24.04 LTS.
Everytime when I execute: nvidia-smi -q | grep -A2 -i fabric
I get State & Status: N/A
Is this normal behaviour?
user01@server01:~$ nvidia-smi
Wed Jan 15 08:01:23 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Unuser01r. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 32C P0 53W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:0B:00.0 Off | 0 |
| N/A 32C P0 52W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:48:00.0 Off | 0 |
| N/A 31C P0 57W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:4C:00.0 Off | 0 |
| N/A 32C P0 53W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:88:00.0 Off | 0 |
| N/A 30C P0 53W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:8B:00.0 Off | 0 |
| N/A 33C P0 52W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:C8:00.0 Off | 0 |
| N/A 33C P0 61W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:CB:00.0 Off | 0 |
| N/A 32C P0 54W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
user01@server01:~$ dcgmi discovery -l
8 GPUs found.
±-------±---------------------------------------------------------------------+
| GPU ID | Device Information |
±-------±---------------------------------------------------------------------+
| 0 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:07:00.0 |
| | Device UUID: GPU-829f1fdb-d92c-f73f-b083-198c70f3768d |
±-------±---------------------------------------------------------------------+
| 1 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:0B:00.0 |
| | Device UUID: GPU-b1127afc-0997-3f01-533f-a002c4c891dc |
±-------±---------------------------------------------------------------------+
| 2 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:48:00.0 |
| | Device UUID: GPU-60f65417-5658-f28e-60a5-71ee572f2e05 |
±-------±---------------------------------------------------------------------+
| 3 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:4C:00.0 |
| | Device UUID: GPU-61d488cd-1fa4-b248-5add-58753793ef98 |
±-------±---------------------------------------------------------------------+
| 4 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:88:00.0 |
| | Device UUID: GPU-cd49d8da-8b49-db0f-31b1-9257e6ead69c |
±-------±---------------------------------------------------------------------+
| 5 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:8B:00.0 |
| | Device UUID: GPU-e23a1b13-5fda-107b-db20-886c4d0b87f7 |
±-------±---------------------------------------------------------------------+
| 6 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:C8:00.0 |
| | Device UUID: GPU-d413b274-743f-ad68-dc3b-8e2442326c55 |
±-------±---------------------------------------------------------------------+
| 7 | Name: NVIDIA A100-SXM4-40GB |
| | PCI Bus ID: 00000000:CB:00.0 |
| | Device UUID: GPU-655fa6ae-b10f-51a5-d39f-2f99d1f8116f |
±-------±---------------------------------------------------------------------+
6 NvSwitches found.
±----------+
| Switch ID |
±----------+
| 8 |
| 13 |
| 11 |
| 12 |
| 10 |
| 9 |
±----------+
0 CPUs found.
±-------±---------------------------------------------------------------------+
| CPU ID | Device Information |
±-------±---------------------------------------------------------------------+
±-------±---------------------------------------------------------------------+
user01@server01:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 48-63 N/A N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 48-63 N/A N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 16-31 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 16-31 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 112-127 N/A N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 112-127 N/A N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 80-95 5 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 80-95 5 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
user01@server01:~$ dcgmi diag -r 1
Successfully ran diagnostic for group.
±--------------------------±-----------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------±-----------------------------------------------|
| DCGM Version | 4.0.0 |
| Driver Version Detected | 550.127.08 |
| GPU Device IDs Detected | 20b0, 20b0, 20b0, 20b0, 20b0, 20b0, 20b0, 20b0 |
|----- Deployment --------±-----------------------------------------------|
| software | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU3: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
±--------------------------±-----------------------------------------------+
user01@server01:~$ !853
nvidia-smi -q | grep -A2 -i fabric
Fabric
State : N/A
Status : N/A
Fabric
State : N/A
Status : N/A
–
Fabric
State : N/A
Status : N/A
Fabric
State : N/A
Status : N/A
–
Fabric
State : N/A
Status : N/A
Fabric
State : N/A
Status : N/A
–
Fabric
State : N/A
Status : N/A
Fabric
State : N/A
Status : N/A
user01@server01:~$ sudo systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
Active: active (running) since Tue 2025-01-14 15:07:58 CET; 16h ago
Process: 3493 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
Main PID: 3512 (nv-fabricmanage)
Tasks: 19 (limit: 629145)
Memory: 26.8M (peak: 29.5M)
CPU: 23.566s
CGroup: /system.slice/nvidia-fabricmanager.service
└─3512 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Jan 14 15:07:45 server01 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service…
Jan 14 15:07:46 server01 nv-fabricmanager[3512]: Connected to 1 node.
Jan 14 15:07:58 server01 nv-fabricmanager[3512]: Successfully configured all the available GPUs and NVSwitches to route NVLink traffic.
Jan 14 15:07:58 server01 systemd[1]: Started nvidia-fabricmanager.service - NVIDIA fabric manager service.