NCCL all_reduce_perf hangs with A100 SXM4 on AMD CPUs (driver 570.172.08 + CUDA 12.8) but works on driver 550.163.01

Mich96 · September 3, 2025, 3:11pm

I’ve run into an issue with nccl-tests all_reduce_perf hang on my A100 SXM4 80GB AMD server after updating drivers to 570.172.08, CUDA 12.8:

System setup

Server type: 4× NVIDIA A100-SXM4-80GB
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge XE8545
CPU: AMD EPYC 7413 24-Core Processor
OS:
PRETTY_NAME=“Ubuntu 22.04.4 LTS”
NAME=“Ubuntu”
VERSION_ID=“22.04”
VERSION=“22.04.4 LTS (Jammy Jellyfish)”
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
UBUNTU_CODENAME=jammy

uname -r 6.8.0-79-generic
Driver / CUDA / NCCL versions tested:
- Working: Driver 550.163.01, CUDA 12.4, NCCL 2.20.5+cuda12.4
- Problematic: Driver 570.172.08, CUDA 12.8, NCCL 2.20.5+cuda12.4 / NCCL 2.25.1+cuda12.8

GPU topology (nvidia-smi topo -m):

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NV4     NV4     SYS     SYS     18-23,66-71     3               N/A
GPU1    NV4      X      NV4     NV4     SYS     SYS     6-11,54-59      1               N/A
GPU2    NV4     NV4      X      NV4     SYS     SYS     42-47,90-95     7               N/A
GPU3    NV4     NV4     NV4      X      SYS     SYS     30-35,78-83     5               N/A
NIC0    SYS     SYS     SYS     SYS      X      PIX
NIC1    SYS     SYS     SYS     SYS     PIX      X

nvidia-smi:

image651×432 16.3 KB

Issue

Running all_reduce_perf from nccl-tests hangs when using all 4 GPUs or any subset that includes GPU 3.
Example:
- CUDA_VISIBLE_DEVICES=0,1,2 ./all_reduce_perf -g 3 → passes
- CUDA_VISIBLE_DEVICES=0,1,3 ./all_reduce_perf -g 3 → hangs
- CUDA_VISIBLE_DEVICES=0,3 ./all_reduce_perf -g 2 → hangs
  
  (tested on multiple servers to rule out faulty gpu 3 or its connections)

Debugging attempts

Tried amd_iommu=off on kernel boot line (I am are running baremetal)
Disabled ACS on all PCIe devices using NCCL’s ACS troubleshooting script (setpci ECAP_ACS+0x6.w=0000).

attached are debug logs run with:
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH,COLL ./all_reduce_perf_amd -g 4

nccl_debug.txt (59.1 KB)

topo.txt (3.2 KB)

Observations

The issue goes away if I downgrade the driver to 550.163.01 (with CUDA 12.4 + NCCL 2.20.5)
On a different server with 8× A100 SXM4 but Intel CPUs, the same driver 570.172.08 + CUDA 12.8 + NCCL 2.20.5 stack works fine with no hangs.

Let me know if any information is missing,
Thanks

Topic		Replies	Views
NCCL AllGather & AllReduce error CUDA Programming and Performance	1	2593	April 18, 2018
Proccess block when call Nccl reduce CUDA Programming and Performance	1	778	May 19, 2018
nccl-test with nccl2 not run in centos6, crash in init rank GPU-Accelerated Libraries	1	641	February 2, 2018
ncclAllReduce hangs GPU-Accelerated Libraries nccl	1	933	December 18, 2023
Fast Multi-GPU collectives with NCCL Technical Blog	14	1112	May 11, 2018
nccl-test with nccl2 not run in centos6, crash in init rank CUDA Programming and Performance	2	728	February 2, 2018
About NCCL benchmark result GPU-Accelerated Libraries nccl	0	1572	November 17, 2022
NCCL testing: Error: no plugin found (libnccl-net.so) CUDA Programming and Performance	4	7326	October 15, 2019
ncclAllReduce failed: unhandled cuda error DGX User Forum	9	4391	May 27, 2021
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10812	January 18, 2008

NCCL all_reduce_perf hangs with A100 SXM4 on AMD CPUs (driver 570.172.08 + CUDA 12.8) but works on driver 550.163.01

System setup

Issue

Debugging attempts

Observations

Related topics