Fabricmanager: NVSwitches found. dcgmi: NVSwitches not found

I’ve set up CUDA with many different versions and different hardware before, but this is my first time with eight H100 cards with NVSwitches.
I usually test with a simple tensorflow program, but it’s not detecting the cards.

nvidia-smi shows all eight cards.
dcgmi shows all eight cards, but no NVswitches.
fabricmanager logs show four NVswitches.
Output from the above in successive comments.

I’m guessing that I’m missing an obvious step, but I’m not sure what that would be. I’ve heard that the H100 cards get exposed as individual cards if fabricmanager isn’t running, which I also tried.

nvidia-smi
Thu Oct 10 16:32:40 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:2A:00.0 Off | 0 |
| N/A 28C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:3A:00.0 Off | 0 |
| N/A 28C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:5D:00.0 Off | 0 |
| N/A 25C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:9A:00.0 Off | 0 |
| N/A 26C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:AB:00.0 Off | 0 |
| N/A 27C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:BA:00.0 Off | 0 |
| N/A 28C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:DB:00.0 Off | 0 |
| N/A 26C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+

dcgmi discovery -l
8 GPUs found.
±-------±---------------------------------------------------------------------+
| GPU ID | Device Information |
±-------±---------------------------------------------------------------------+
| 0 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:18:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
| 1 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:2A:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
| 2 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:3A:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
| 3 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:5D:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
| 4 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:9A:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
| 5 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:AB:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
| 6 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:BA:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
| 7 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:DB:00.0 |
| | Device UUID: [redacted] |
±-------±---------------------------------------------------------------------+
0 NvSwitches found.
±----------+
| Switch ID |
±----------+
±----------+
0 CPUs found.
±-------±---------------------------------------------------------------------+
| CPU ID | Device Information |
±-------±---------------------------------------------------------------------+
±-------±---------------------------------------------------------------------+

grep NV /var/log/fabricmanager.log |tail -38
[Oct 07 2024 17:54:09] [INFO] [tid 84606] completed NVSwitch 0/3 routing configuration
[Oct 07 2024 17:54:09] [INFO] [tid 84606] Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
[Oct 07 2024 17:54:14] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 4 port 34 from Compute NodeId 0 GPU Id 2 port 12.
[Oct 07 2024 17:54:14] [INFO] [tid 84624] added GPU with UUID [redacted] based on NVLink Inband GPU probe request.
[Oct 07 2024 17:54:14] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 3 port 3 from Compute NodeId 0 GPU Id 4 port 6.
[Oct 07 2024 17:54:14] [INFO] [tid 84624] added GPU with UUID [redacted] based on NVLink Inband GPU probe request.
[Oct 07 2024 17:54:15] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 2 port 60 from Compute NodeId 0 GPU Id 8 port 0.
[Oct 07 2024 17:54:15] [INFO] [tid 84624] added GPU with UUID v based on NVLink Inband GPU probe request.
[Oct 07 2024 17:54:15] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 3 port 56 from Compute NodeId 0 GPU Id 7 port 8.
[Oct 07 2024 17:54:15] [INFO] [tid 84624] added GPU with UUID [redacted] based on NVLink Inband GPU probe request.
[Oct 07 2024 17:54:16] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 1 port 62 from Compute NodeId 0 GPU Id 5 port 12.
[Oct 07 2024 17:54:16] [INFO] [tid 84624] added GPU with UUID [redacted] based on NVLink Inband GPU probe request.
[Oct 07 2024 17:54:16] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 3 port 0 from Compute NodeId 0 GPU Id 6 port 4.
[Oct 07 2024 17:54:16] [INFO] [tid 84624] added GPU with UUID [redacted] based on NVLink Inband GPU probe request.
[Oct 07 2024 17:54:17] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 2 port 1 from Compute NodeId 0 GPU Id 3 port 16.
[Oct 07 2024 17:54:17] [INFO] [tid 84624] added GPU with UUID [redacted] based on NVLink Inband GPU probe request.
[Oct 07 2024 17:54:17] [INFO] [tid 84624] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 1 port 44 from Compute NodeId 0 GPU Id 1 port 12.
[Oct 07 2024 17:54:17] [INFO] [tid 84624] added GPU with UUID [redacted] based on NVLink Inband GPU probe request.
[Oct 07 2024 18:10:13] [INFO] [tid 88075] Option when facing GPU to NVSwitch NVLink failure = 0
[Oct 07 2024 18:10:13] [INFO] [tid 88075] Option when facing NVSwitch to NVSwitch NVLink failure = 0
[Oct 07 2024 18:10:13] [INFO] [tid 88075] Option when facing NVSwitch failure = 0
[Oct 07 2024 18:10:13] [INFO] [tid 88075] NVLink Domain ID : NDI-[redacted]
[Oct 07 2024 18:10:14] [INFO] [tid 88075] getting NVSwitch device information
[Oct 07 2024 18:10:14] [INFO] [tid 88075] number of devices specified in topology file NVSwitches: 4, GPUs: 8
[Oct 07 2024 18:10:14] [INFO] [tid 88075] getting NVSwitch device information
[Oct 07 2024 18:10:14] [INFO] [tid 88075] dumping all the detected NVSwitch information for Switch NodeId 0
[Oct 07 2024 18:10:14] [INFO] [tid 88075] getting NVLink device information
[Oct 07 2024 18:10:14] [INFO] [tid 88075] NVLink Inband feature is enabled. Hence Fabric Manager is not opening and operating on GPUs directly.
[Oct 07 2024 18:10:14] [INFO] [tid 88075] NVLink Autonomous Link Initialization (ALI) feature is enabled.
[Oct 07 2024 18:10:14] [INFO] [tid 88075] start NVSwitch 0/0 routing configuration
[Oct 07 2024 18:10:14] [INFO] [tid 88075] completed NVSwitch 0/0 routing configuration
[Oct 07 2024 18:10:14] [INFO] [tid 88075] start NVSwitch 0/1 routing configuration
[Oct 07 2024 18:10:14] [INFO] [tid 88075] completed NVSwitch 0/1 routing configuration
[Oct 07 2024 18:10:14] [INFO] [tid 88075] start NVSwitch 0/2 routing configuration
[Oct 07 2024 18:10:14] [INFO] [tid 88075] completed NVSwitch 0/2 routing configuration
[Oct 07 2024 18:10:14] [INFO] [tid 88075] start NVSwitch 0/3 routing configuration
[Oct 07 2024 18:10:14] [INFO] [tid 88075] completed NVSwitch 0/3 routing configuration
[Oct 07 2024 18:10:15] [INFO] [tid 88075] Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.

nvidia-smi -q |grep -A2 -i fabric
Fabric
State : Completed
Status : Success

Fabric
    State                             : Completed
    Status                            : Success


Fabric
State : Completed
Status : Success

Fabric
    State                             : Completed
    Status                            : Success


Fabric
State : Completed
Status : Success

Fabric
    State                             : Completed
    Status                            : Success


Fabric
State : Completed
Status : Success

Fabric
    State                             : Completed
    Status                            : Success

I did miss a step! Getting Started — NVIDIA DCGM Documentation latest documentation
(installing NSCQ)
dcgmi discovery -l
8 GPUs found.
.
.
.
4 NvSwitches found.
±----------+
| Switch ID |
±----------+
| 2 |
| 1 |
| 3 |
| 0 |
±----------+

Still unsure how to test the GPUs, as tensorflow still isn’t finding them. I’m guessing there’s an extra include required to run on HPX systems?