[nvbandwidth] Debug an Anomalous Host to Device Memory Bandwidth

Hello,

When using nvbandwidth to measure various bandwidth between host and devices, we observed an anomalous bandwidth with two of the A100 GPUs in our system.

  • Specs:
CPU: 2 x AMD EPYC 7543
GPU:  8 x A100-SMX4 (driver: 510.47.03) 
OS:  Centos 7 (kernel: 3.10.0-1160.el7.x86_64) 
Env: gcc/11.3.0, cuda/11. 6 
  • Topology:
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	24-31	3
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	24-31	3
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	8-15	1
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	8-15	1
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	56-63	7
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	56-63	7
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	40-47	5
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	40-47	5
mlx5_0	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_1	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_2	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_3	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_4	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS	SYS	SYS		
mlx5_5	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS	SYS	SYS		
mlx5_6	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS		
mlx5_7	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS		
mlx5_8	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX		
mlx5_9	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 		
  • host-to-device bandwidth measure with nvbandwidth
nvbandwidth Version: v0.2
Built from Git version: 6cefdda

Device 0: NVIDIA A100-SXM4-80GB
Device 1: NVIDIA A100-SXM4-80GB
Device 2: NVIDIA A100-SXM4-80GB
Device 3: NVIDIA A100-SXM4-80GB
Device 4: NVIDIA A100-SXM4-80GB
Device 5: NVIDIA A100-SXM4-80GB
Device 6: NVIDIA A100-SXM4-80GB
Device 7: NVIDIA A100-SXM4-80GB

Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
          0         1         *2         *3         4         5         6         7
0     24.56     24.57     *12.21     *12.20     24.55     24.33     24.57     24.46

SUM host_to_device_memcpy_ce 171.45

Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
          0         1         *2         *3         4         5         6         7
0     26.33     24.16     *13.18     *13.18     24.16     23.96     24.45     25.34

SUM device_to_host_memcpy_ce 174.76

Bandwidth between host and GPU_2/GPU_3 are shown to be half of those between host and other GPUs.
Thus, we also observed a discrepancy of HPL performance between two symmetric groups of GPUs, which could be attributed to the aforementioned reduced bandwidths.

CUDA_VISIBLE_DEVICES=0,1,2,3 -> Rmax = 44.8 TFlops 
CUDA_VISIBLE_DEVICES=4,5,6,7 -> Rmax = 52.5 TFLops 

We suspect that this could be an hardware issues.

Please kindly suggest ways to further debug this issue. Your insights are much appreciated

Regards,

If it were me, I would probably start by attempting reconfirm the measurement using a directed test, such as bandwidthTest specifying a correct CPU socket or numa node via e.g. taskset, and also specifying one of the GPUs in question either via a command line switch (to bandwidthTest) or via CUDA_VISIBLE_DEVICES.

If that reproduced the observation, I would probably check PCIE settings for that device via nvidia-smi, comparing them to output from other devices on that machine.

Hi @Robert_Crovella

Per your suggestion, I’ve redone the measurement using bandwidTest provided with CUDA samples.

  • CUDA_VISIBLE_DEVICES=0 (mapped to 3rd NUMA domain according to topology)
$ CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=3 --membind=3 ./bandwidthTest 
Running on...

 Device 0: NVIDIA A100-SXM4-80GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			24.5

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			26.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			1368.2

Result = PASS
  • CUDA_VISIBLE_DEVICES=2 (mapped to 1st NUMA domain according to topology)
$ CUDA_VISIBLE_DEVICES=2 numactl --cpunodebind=1 --membind=1 ./bandwidthTest
Running on...

 Device 0: NVIDIA A100-SXM4-80GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			12.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			1370.8

Result = PASS

Thus the bandwidth measured here are also consistent with nvbandwidth

  • PCIe queried with nvidia-smi
 $ nvidia-smi --query-gpu=index,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current --format=csv
index, pcie.link.gen.current, pcie.link.gen.max, pcie.link.width.current
0, 4, 4, 16
1, 4, 4, 16
2, 4, 4, 16
3, 4, 4, 16
4, 4, 4, 16
5, 4, 4, 16
6, 4, 4, 16
7, 4, 4, 16

So I think all the PCIE lanes are working as expected.
If my understanding is insufficient, please elaborate on the the PCIE setting

Thanks for the suggestions.

I’m not sure I will be able to offer any more info that is useful.

I guess I would want to make sure that nothing else is running on the system. Perhaps reboot the system and try things again.

The PCIE data suggests that the GPU doesn’t have any obvious PCIE issues. However the path from CPU to GPU also passes through some PCIE switches, and I think it might be the case that both of these GPUs are connected to the same PCIE switch. So you might want to review the PCIE topology of your system. Coupled with that, since a PCIE switch is involved, I would want to make sure that the system has the latest proper firmware installed.

You can use the lspci command to inspect information about configuration of other PCIE devices such as the interposing switches and their links, but I don’t have a recipe for you. The command itself is part of a typical linux install, and you can find command line help for it as well as write-ups on the internet.

So there are two CPUs in the system, each comprising four core complexes (CCX), correct? I wonder how the PCIe slots are mapped to these two layers of NUMA-ness.

I do not have experience with such systems, but if that were my system, I would review the numactl settings to make sure they are entirely regular, in that there is exactly one core in each of the eight CCXs talking to the nearest of the eight GPUs, and that this is the same core in each of the eight CCXs (e.g. core 0).

The beauty of a chiplet design…

Try these bindings ( local rank from 0 to 7):
numactl --physcpubind=32-47 --membind=2 $APP
numactl --physcpubind=48-63 --membind=3 $APP
numactl --physcpubind=0-15 --membind=0 $APP
numactl --physcpubind=16-31 --membind=1 $APP
numactl --physcpubind=96-111 --membind=6 $APP
numactl --physcpubind=112-127 --membind=7 $APP
numactl --physcpubind=64-79 --membind=4 $APP
numactl --physcpubind=80-95 --membind=5 $APP

1 Like

@Robert_Crovella

When inspecting and comparing the PCIe topology with lspci, we did not notice difference between this node and another identical node. So it is unlikely that the PCIe switch were misconfigured.
Nevertheless, there is not much we can do further with software testing.
Hardware engineers will inspect the server. I will update the thread if we pinpoint the issue.
Thanks very much for your time.

@njuffa:

Thanks for suggestion.
The bios offers various NUMA per socket (NPS) option.
In this case we use NPS=4, and allocate the process in the domain that closest to the GPU.

@mfatica
Thanks for suggestion.

The server was configured with NPS=4, i.e. 8 NUMA nodes in total.
The bandwidthTest that Robert requested was already conducted with correct NUMA mapping:

$ CUDA_VISIBLE_DEVICES=2 numactl --cpunodebind=1 --membind=1 ./bandwidthTest

I still think it would be a good test to use exactly the bindings suggested by mfatica.