NCCL all_reduce_perf hangs with A100 SXM4 on AMD CPUs (driver 570.172.08 + CUDA 12.8) but works on driver 550.163.01

I’ve run into an issue with nccl-tests all_reduce_perf hang on my A100 SXM4 80GB AMD server after updating drivers to 570.172.08, CUDA 12.8:

System setup

  • Server type: 4× NVIDIA A100-SXM4-80GB
    System Information
    Manufacturer: Dell Inc.
    Product Name: PowerEdge XE8545

  • CPU: AMD EPYC 7413 24-Core Processor

  • OS:
    PRETTY_NAME=“Ubuntu 22.04.4 LTS”
    NAME=“Ubuntu”
    VERSION_ID=“22.04”
    VERSION=“22.04.4 LTS (Jammy Jellyfish)”
    VERSION_CODENAME=jammy
    ID=ubuntu
    ID_LIKE=debian
    UBUNTU_CODENAME=jammy

    uname -r 6.8.0-79-generic

  • Driver / CUDA / NCCL versions tested:

    • Working: Driver 550.163.01, CUDA 12.4, NCCL 2.20.5+cuda12.4

    • Problematic: Driver 570.172.08, CUDA 12.8, NCCL 2.20.5+cuda12.4 / NCCL 2.25.1+cuda12.8

  • GPU topology (nvidia-smi topo -m):

            GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
    GPU0     X      NV4     NV4     NV4     SYS     SYS     18-23,66-71     3               N/A
    GPU1    NV4      X      NV4     NV4     SYS     SYS     6-11,54-59      1               N/A
    GPU2    NV4     NV4      X      NV4     SYS     SYS     42-47,90-95     7               N/A
    GPU3    NV4     NV4     NV4      X      SYS     SYS     30-35,78-83     5               N/A
    NIC0    SYS     SYS     SYS     SYS      X      PIX
    NIC1    SYS     SYS     SYS     SYS     PIX      X 
    
    
  • nvidia-smi:

Issue

  • Running all_reduce_perf from nccl-tests hangs when using all 4 GPUs or any subset that includes GPU 3.

  • Example:

    • CUDA_VISIBLE_DEVICES=0,1,2 ./all_reduce_perf -g 3 → passes

    • CUDA_VISIBLE_DEVICES=0,1,3 ./all_reduce_perf -g 3 → hangs

    • CUDA_VISIBLE_DEVICES=0,3 ./all_reduce_perf -g 2 → hangs

      (tested on multiple servers to rule out faulty gpu 3 or its connections)

Debugging attempts

  • Tried amd_iommu=off on kernel boot line (I am are running baremetal)

  • Disabled ACS on all PCIe devices using NCCL’s ACS troubleshooting script (setpci ECAP_ACS+0x6.w=0000).

attached are debug logs run with:
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH,COLL ./all_reduce_perf_amd -g 4

nccl_debug.txt (59.1 KB)

topo.txt (3.2 KB)

Observations

  • The issue goes away if I downgrade the driver to 550.163.01 (with CUDA 12.4 + NCCL 2.20.5)

  • On a different server with 8× A100 SXM4 but Intel CPUs, the same driver 570.172.08 + CUDA 12.8 + NCCL 2.20.5 stack works fine with no hangs.

Let me know if any information is missing,
Thanks

1 Like