Problem with Configuring Two Tesla H100 PCIe Cards via NVLink on Supermicro Platform

h100-nvlink-report.txt (7.1 KB)
nvidia-bug-report-18-10_01222025.log (6.0 MB)
Hello,

I’m encountering an issue while attempting to configure two Tesla H100 PCIe GPUs through NVLink on a Supermicro platform. Here’s a summary of the setup and the problem:

System Details:

  • OS: Linux 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Motherboard: Supermicro X13DEG-QT
  • Chassis: Supermicro CSE-749TS-R2K05BP
  • BIOS Version: 2.4 (08/23/2024)
  • GPUs: 2 x NVIDIA Tesla H100 PCIe
  • Driver Version: 550.144.03
  • CUDA Version: 12.4
  • UEFI Settings:
    • “Above 4G Decoding” is enabled.
    • “Resizable BAR” is enabled.

Problem:

NVLink does not establish between the two H100 cards, even though NVLink support is explicitly stated in the product documentation.

Observations:

  1. dmesg shows the NVLink core initializing:

    nvidia-nvlink: Nvlink Core is being initialized, major device number 505
    

    However, no further NVLink-related messages appear.

  2. Output from nvidia-smi topo -m indicates no NVLink connection:

    GPU0  GPU1  SYS  NODE
    GPU0   X    NODE
    GPU1  NODE   X
    
  3. The cards are recognized correctly by nvidia-smi but no MIG instances or active NVLink are shown:

    +-----------------------------------------------------------------------------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    |=========================================+========================+======================|
    |   0  NVIDIA H100 PCIe               Off |   00000000:2A:00.0 Off |                    0 |
    |   1  NVIDIA H100 PCIe               Off |   00000000:3D:00.0 Off |                    0 |
    +-----------------------------------------------------------------------------------------+
    
  4. The dmesg log for both GPUs contains DOE (Data Object Exchange) timeout errors:

    pci 0000:2a:00.0: DOE: [2c8] ABORT timed out
    pci 0000:3d:00.0: DOE: [2c8] ABORT timed out
    

Steps Taken:

  • Verified BIOS settings for “Above 4G Decoding” and “Resizable BAR.”
  • Ensured both cards are seated properly and have adequate power.
  • Confirmed that the driver and CUDA versions are compatible with the H100 cards.

Request:

Could anyone provide insights or suggestions on troubleshooting this issue? Is there a specific NVLink configuration step I’m missing for Tesla H100 PCIe cards on this platform?

Thanks in advance for your help!

Do you have all three NVlink bridges connected?