Problem with Configuring Two Tesla H100 PCIe Cards via NVLink on Supermicro Platform

arman.airapetov · January 23, 2025, 12:58pm

h100-nvlink-report.txt (7.1 KB)
nvidia-bug-report-18-10_01222025.log (6.0 MB)
Hello,

I’m encountering an issue while attempting to configure two Tesla H100 PCIe GPUs through NVLink on a Supermicro platform. Here’s a summary of the setup and the problem:

System Details:

OS: Linux 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Motherboard: Supermicro X13DEG-QT
Chassis: Supermicro CSE-749TS-R2K05BP
BIOS Version: 2.4 (08/23/2024)
GPUs: 2 x NVIDIA Tesla H100 PCIe
Driver Version: 550.144.03
CUDA Version: 12.4
UEFI Settings:
- “Above 4G Decoding” is enabled.
- “Resizable BAR” is enabled.

Problem:

NVLink does not establish between the two H100 cards, even though NVLink support is explicitly stated in the product documentation.

Observations:

dmesg shows the NVLink core initializing:
```
nvidia-nvlink: Nvlink Core is being initialized, major device number 505
```
However, no further NVLink-related messages appear.
Output from nvidia-smi topo -m indicates no NVLink connection:
```
GPU0  GPU1  SYS  NODE
GPU0   X    NODE
GPU1  NODE   X
```

The cards are recognized correctly by nvidia-smi but no MIG instances or active NVLink are shown:

+-----------------------------------------------------------------------------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               Off |   00000000:2A:00.0 Off |                    0 |
|   1  NVIDIA H100 PCIe               Off |   00000000:3D:00.0 Off |                    0 |
+-----------------------------------------------------------------------------------------+

The dmesg log for both GPUs contains DOE (Data Object Exchange) timeout errors:

pci 0000:2a:00.0: DOE: [2c8] ABORT timed out
pci 0000:3d:00.0: DOE: [2c8] ABORT timed out

Steps Taken:

Verified BIOS settings for “Above 4G Decoding” and “Resizable BAR.”
Ensured both cards are seated properly and have adequate power.
Confirmed that the driver and CUDA versions are compatible with the H100 cards.

Request:

Could anyone provide insights or suggestions on troubleshooting this issue? Is there a specific NVLink configuration step I’m missing for Tesla H100 PCIe cards on this platform?

Thanks in advance for your help!

rs277 · January 23, 2025, 6:41pm

Do you have all three NVlink bridges connected?

Topic		Replies	Views
Installing driver fails for Tesla V100 Linux	3	3696	October 12, 2021
Nvidia Driver not loaded Linux	0	592	October 23, 2020
Tesla V100 Nvlink (Tesla V100-SXM2-32GB) GPU passthrough Tesla Boards	1	5330	August 24, 2018
NVLink error 74 fatal error detected CUDA Setup and Installation	4	3498	December 1, 2017
H100 PCIe, NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux kernel , ubuntu , gpu , driver , nvidia-smi	17	4464	April 12, 2024
P2P Transfers Across Single PCIe Switch Fail CUDA Programming and Performance	5	1343	April 15, 2024
P100 not showing up in nvidia-smi CUDA Setup and Installation	17	8982	November 20, 2022
Issue with P2P connection using two RTX A4500 CUDA Programming and Performance cuda , ubuntu	7	2438	March 31, 2023
Ubuntu - NVLink not working with two RTX 3090 GPU - Hardware	5	1895	February 6, 2023
CUDA device not initialized error on all calls, HGX A100, Centos 7 Linux cuda	9	4561	December 6, 2021