I am testing the NCCL performance in my server with two A5000 GPU. They are directly connected to the CPU with PCIe 4.0 x16 without NVLink or PCIe Switch. I expect the throughput can reach 20 GB/s but it is only 12 GB/s.
I tried the p2pBandwidthLatencyTest --sm_copy
in the cuda-samples. The results show that ‘p2p enabled’ bandwidth (12 GB/s) is much lower than ‘p2p disabled’(21.6 GB/s), which is weird, I think.
I checked the topo and the PCIe ACS configuration, they are OK. Does anyone have any idea? Thanks!
# ./p2pBandwidthLatencyTest --sm_copy
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A5000, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A5000, pciBusID: e3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 675.53 21.60
1 20.67 676.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 674.65 12.43
1 12.43 677.58
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 680.98 29.89
1 30.06 680.09
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 680.83 24.86
1 24.86 680.53
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.58 14.69
1 10.70 1.57
CPU 0 1
0 2.76 7.46
1 7.08 2.96
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.58 2.69
1 2.78 1.57
CPU 0 1
0 2.80 2.07
1 2.28 3.14
The nvidia-smi topo -m
shows:
GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity
GPU0 X SYS NODE SYS 0,2,4,6,8,10 0
GPU1 SYS X SYS NODE 1,3,5,7,9,11 1
NIC0 NODE SYS X SYS
NIC1 SYS NODE SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
And the lspci -vvv | grep ACSCtl
gives
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
I also run the bandwidthTest
in the cuda-samples. It seems that D2H and H2D can get 25GB/s both.
#./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA RTX A5000
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 24.3
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 26.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 648.8
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
hi @chengjunjia1997, could you find out what the reason behind the drop in the throughput was while having the p2p enabled?