How can I improve the 'p2p enabled' bandwidth when testing NCCL performance with two A5000 GPU using PCIe 4.0 x16?

I am testing the NCCL performance in my server with two A5000 GPU. They are directly connected to the CPU with PCIe 4.0 x16 without NVLink or PCIe Switch. I expect the throughput can reach 20 GB/s but it is only 12 GB/s.

I tried the p2pBandwidthLatencyTest --sm_copy in the cuda-samples. The results show that ‘p2p enabled’ bandwidth (12 GB/s) is much lower than ‘p2p disabled’(21.6 GB/s), which is weird, I think.

I checked the topo and the PCIe ACS configuration, they are OK. Does anyone have any idea? Thanks!

# ./p2pBandwidthLatencyTest --sm_copy
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A5000, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A5000, pciBusID: e3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 675.53  21.60
     1  20.67 676.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 674.65  12.43
     1  12.43 677.58
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 680.98  29.89
     1  30.06 680.09
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 680.83  24.86
     1  24.86 680.53
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.58  14.69
     1  10.70   1.57

   CPU     0      1
     0   2.76   7.46
     1   7.08   2.96
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.58   2.69
     1   2.78   1.57

   CPU     0      1
     0   2.80   2.07
     1   2.28   3.14

The nvidia-smi topo -m shows:

        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity
GPU0     X      SYS     NODE    SYS     0,2,4,6,8,10    0
GPU1    SYS      X      SYS     NODE    1,3,5,7,9,11    1
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

And the lspci -vvv | grep ACSCtl gives

ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

I also run the bandwidthTest in the cuda-samples. It seems that D2H and H2D can get 25GB/s both.

 #./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA RTX A5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     24.3

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     26.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     648.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

hi @chengjunjia1997, could you find out what the reason behind the drop in the throughput was while having the p2p enabled?