Hello,
I have an issue regarding the bandwidth between my 2 GPUs (RTX A4500).
They are connected via PCIe 4.0 x 16, and my motherboard is a MBD-X12DPG-OA6.
Here is the output of the cuda sample p2pBandwidthLatencyTest
:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A4500, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A4500, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 560.44 17.37
1 17.94 562.05
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 562.25 46.23
1 39.22 561.44
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 566.12 19.75
1 19.22 566.74
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 566.43 92.45
1 92.37 566.84
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.29 20.54
1 11.64 2.31
CPU 0 1
0 2.66 6.89
1 6.85 2.66
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.29 1.31
1 1.37 2.30
CPU 0 1
0 2.75 2.01
1 2.07 2.69
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
NB: I have really close values with other means of measures, such as nvbandwidth and hand-made code.
NB2: If I use --sm_copy option, I reach ~ 250 GB/s for P2P Device 1 ↔ Device 0 Unidirectional
The maximum throughput of PCIe 4.0 x 16 is 32 GB/s, and I get measures of ~ 45 GB/s for P2P Device 1 ↔ Device 0 Unidirectional, which is supposed to be impossible.
Here is a copy of my nvidia-smi
:
Thu Apr 6 17:19:02 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4500 On | 00000000:4F:00.0 Off | Off |
| 30% 28C P8 20W / 200W | 0MiB / 20470MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A4500 On | 00000000:52:00.0 Off | Off |
| 30% 27C P8 7W / 200W | 0MiB / 20470MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
And a copy of my nvidia-smi topo -m
:
GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X PXB PXB PXB 0-11,24-35 0
GPU1 PXB X PXB PXB 0-11,24-35 0
mlx5_0 PXB PXB X PIX
mlx5_1 PXB PXB PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Do you know what may cause this ? I suspect Cuda to sort of optimize transfers by “not-really” doing a real Device 0 → Device 1, but I don’t understand trully how etc…
Thank you in advance !