h100-nvlink-report.txt (7.1 KB)
nvidia-bug-report-18-10_01222025.log (6.0 MB)
Hello,
I’m encountering an issue while attempting to configure two Tesla H100 PCIe GPUs through NVLink on a Supermicro platform. Here’s a summary of the setup and the problem:
System Details:
- OS: Linux 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Motherboard: Supermicro X13DEG-QT
- Chassis: Supermicro CSE-749TS-R2K05BP
- BIOS Version: 2.4 (08/23/2024)
- GPUs: 2 x NVIDIA Tesla H100 PCIe
- Driver Version: 550.144.03
- CUDA Version: 12.4
- UEFI Settings:
- “Above 4G Decoding” is enabled.
- “Resizable BAR” is enabled.
Problem:
NVLink does not establish between the two H100 cards, even though NVLink support is explicitly stated in the product documentation.
Observations:
-
dmesg
shows the NVLink core initializing:nvidia-nvlink: Nvlink Core is being initialized, major device number 505
However, no further NVLink-related messages appear.
-
Output from
nvidia-smi topo -m
indicates no NVLink connection:GPU0 GPU1 SYS NODE GPU0 X NODE GPU1 NODE X
-
The cards are recognized correctly by
nvidia-smi
but no MIG instances or active NVLink are shown:+-----------------------------------------------------------------------------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |=========================================+========================+======================| | 0 NVIDIA H100 PCIe Off | 00000000:2A:00.0 Off | 0 | | 1 NVIDIA H100 PCIe Off | 00000000:3D:00.0 Off | 0 | +-----------------------------------------------------------------------------------------+
-
The
dmesg
log for both GPUs containsDOE
(Data Object Exchange) timeout errors:pci 0000:2a:00.0: DOE: [2c8] ABORT timed out pci 0000:3d:00.0: DOE: [2c8] ABORT timed out
Steps Taken:
- Verified BIOS settings for “Above 4G Decoding” and “Resizable BAR.”
- Ensured both cards are seated properly and have adequate power.
- Confirmed that the driver and CUDA versions are compatible with the H100 cards.
Request:
Could anyone provide insights or suggestions on troubleshooting this issue? Is there a specific NVLink configuration step I’m missing for Tesla H100 PCIe cards on this platform?
Thanks in advance for your help!