Hello,
When using nvbandwidth
to measure various bandwidth between host and devices, we observed an anomalous bandwidth with two of the A100 GPUs in our system.
- Specs:
CPU: 2 x AMD EPYC 7543
GPU: 8 x A100-SMX4 (driver: 510.47.03)
OS: Centos 7 (kernel: 3.10.0-1160.el7.x86_64)
Env: gcc/11.3.0, cuda/11. 6
- Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 24-31 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 24-31 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 8-15 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 8-15 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 56-63 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 56-63 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 40-47 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 40-47 5
mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS SYS SYS
mlx5_3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS SYS SYS
mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS
mlx5_5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS
mlx5_6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PIX SYS SYS
mlx5_7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PIX X SYS SYS
mlx5_8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
mlx5_9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
- host-to-device bandwidth measure with
nvbandwidth
nvbandwidth Version: v0.2
Built from Git version: 6cefdda
Device 0: NVIDIA A100-SXM4-80GB
Device 1: NVIDIA A100-SXM4-80GB
Device 2: NVIDIA A100-SXM4-80GB
Device 3: NVIDIA A100-SXM4-80GB
Device 4: NVIDIA A100-SXM4-80GB
Device 5: NVIDIA A100-SXM4-80GB
Device 6: NVIDIA A100-SXM4-80GB
Device 7: NVIDIA A100-SXM4-80GB
Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 *2 *3 4 5 6 7
0 24.56 24.57 *12.21 *12.20 24.55 24.33 24.57 24.46
SUM host_to_device_memcpy_ce 171.45
Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 *2 *3 4 5 6 7
0 26.33 24.16 *13.18 *13.18 24.16 23.96 24.45 25.34
SUM device_to_host_memcpy_ce 174.76
Bandwidth between host and GPU_2/GPU_3 are shown to be half of those between host and other GPUs.
Thus, we also observed a discrepancy of HPL performance between two symmetric groups of GPUs, which could be attributed to the aforementioned reduced bandwidths.
CUDA_VISIBLE_DEVICES=0,1,2,3 -> Rmax = 44.8 TFlops
CUDA_VISIBLE_DEVICES=4,5,6,7 -> Rmax = 52.5 TFLops
We suspect that this could be an hardware issues.
Please kindly suggest ways to further debug this issue. Your insights are much appreciated
Regards,