I’ve run into an issue with nccl-tests
all_reduce_perf
hang on my A100 SXM4 80GB AMD server after updating drivers to 570.172.08
, CUDA 12.8
:
System setup
-
Server type: 4× NVIDIA A100-SXM4-80GB
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge XE8545 -
CPU: AMD EPYC 7413 24-Core Processor
-
OS:
PRETTY_NAME=“Ubuntu 22.04.4 LTS”
NAME=“Ubuntu”
VERSION_ID=“22.04”
VERSION=“22.04.4 LTS (Jammy Jellyfish)”
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
UBUNTU_CODENAME=jammyuname -r 6.8.0-79-generic
-
Driver / CUDA / NCCL versions tested:
-
Working:
Driver 550.163.01
,CUDA 12.4
,NCCL 2.20.5+cuda12.4
-
Problematic:
Driver 570.172.08
,CUDA 12.8
,NCCL 2.20.5+cuda12.4
/NCCL 2.25.1+cuda12.8
-
-
GPU topology (nvidia-smi topo -m):
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV4 NV4 NV4 SYS SYS 18-23,66-71 3 N/A GPU1 NV4 X NV4 NV4 SYS SYS 6-11,54-59 1 N/A GPU2 NV4 NV4 X NV4 SYS SYS 42-47,90-95 7 N/A GPU3 NV4 NV4 NV4 X SYS SYS 30-35,78-83 5 N/A NIC0 SYS SYS SYS SYS X PIX NIC1 SYS SYS SYS SYS PIX X
-
nvidia-smi:
Issue
-
Running
all_reduce_perf
fromnccl-tests
hangs when using all 4 GPUs or any subset that includes GPU 3. -
Example:
-
CUDA_VISIBLE_DEVICES=0,1,2 ./all_reduce_perf -g 3
→ passes -
CUDA_VISIBLE_DEVICES=0,1,3 ./all_reduce_perf -g 3
→ hangs -
CUDA_VISIBLE_DEVICES=0,3 ./all_reduce_perf -g 2
→ hangs(tested on multiple servers to rule out faulty gpu 3 or its connections)
-
Debugging attempts
-
Tried
amd_iommu=off
on kernel boot line (I am are running baremetal) -
Disabled ACS on all PCIe devices using NCCL’s ACS troubleshooting script (
setpci ECAP_ACS+0x6.w=0000
).
attached are debug logs run with:
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH,COLL ./all_reduce_perf_amd -g 4
nccl_debug.txt (59.1 KB)
topo.txt (3.2 KB)
Observations
-
The issue goes away if I downgrade the driver to 550.163.01 (with CUDA 12.4 + NCCL 2.20.5)
-
On a different server with 8× A100 SXM4 but Intel CPUs, the same driver 570.172.08 + CUDA 12.8 + NCCL 2.20.5 stack works fine with no hangs.
Let me know if any information is missing,
Thanks