Issues with NVML: Unable to retrieve NVLink information as all links are inActive

We have a server with 2 A100 GPU’s It is running Ubuntu 22.04 LTS Unfortunately, we are having issues with NVML: Unable to retrieve NVLink information as all links are inActive

nvidia-smi

Mon Dec 2 11:52:18 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:9B:00.0 Off | 0 |
| N/A 39C P0 63W / 300W | 77468MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:C8:00.0 Off | 0 |
| N/A 62C P0 81W / 300W | 23644MiB / 81920MiB | 3% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1456626 C …ng1/miniconda3/envs/bert/bin/python 532MiB |
| 0 N/A N/A 2216806 C …ng1/miniconda3/envs/bert/bin/python 534MiB |
| 0 N/A N/A 4077164 C …wn.ma/.conda/envs/myenv3/bin/python 76374MiB |
| 1 N/A N/A 2828291 C python 23632MiB |
±--------------------------------------------------------------------------------------+
(base) root@r940-01:/etc/modprobe.d# nvidia-smi nvlink --status
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-1dd0e73c-c1ea-89ea-0ce7-3fd7ca9e743e)
NVML: Unable to retrieve NVLink information as all links are inActive
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-3d1d0a11-2cc2-4ec8-212d-3e41ab2777e7)
NVML: Unable to retrieve NVLink information as all links are inActive

Please kindly advise on how we can enable these.

Have you confirmed all 3 NVLink bridges are correctly seated? See Figure 4.