Hello,
I am trying to configure NVLINK connection between two [NVIDIA RTX A4500]. However, I am not achieving expected performances, as demonstrated by the cuda samples:
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A4500 (GPU0) -> NVIDIA RTX A4500 (GPU1) : Yes
> Peer access from NVIDIA RTX A4500 (GPU1) -> NVIDIA RTX A4500 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.01GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A4500, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A4500, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 562.86 17.24
1 17.72 564.28
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 541.97 0.01
1 0.01 564.70
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 337.87 19.55
1 18.98 567.77
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 552.41 0.02
1 0.02 567.67
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.58 38.55
1 11.47 1.51
CPU 0 1
0 2.42 6.16
1 6.12 2.35
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.59 155.44
1 148.67 1.51
CPU 0 1
0 2.36 1.85
1 1.75 2.35
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
The peer-to-peer access seems to work, but it is very slow.
Here is the output of nvidia-smi:
Fri Mar 24 15:47:26 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4500 On | 00000000:4F:00.0 On | Off |
| 30% 31C P8 26W / 200W | 128MiB / 20470MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A4500 On | 00000000:52:00.0 Off | Off |
| 30% 32C P8 27W / 200W | 5MiB / 20470MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1615 G /usr/lib/xorg/Xorg 81MiB |
| 0 N/A N/A 1987 G /usr/bin/gnome-shell 45MiB |
| 1 N/A N/A 1615 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
Here is the output of nvidia-smi topo -m:
GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X NV4 PXB PXB 0-11,24-35 0
GPU1 NV4 X PXB PXB 0-11,24-35 0
mlx5_0 PXB PXB X PIX
mlx5_1 PXB PXB PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Here is the output of nvidia-smi nvlink --status:
GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
Link 0: <inactive>
Link 1: <inactive>
Link 2: <inactive>
Link 3: <inactive>
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)
Link 0: <inactive>
Link 1: <inactive>
Link 2: <inactive>
Link 3: <inactive>
I am on Ubuntu 20.04, here is the mothercard:
MBD-X12DPG-OA6
Few things I discovered during my investigation, the strange output (in my opinion) of nvidia-smi nvlink -c / -p:
nvidia-smi nvlink -c
GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)
nvidia-smi nvlink -p
GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)
I already tried to adapt the solution found here (Multi-GPU Peer to Peer access failing on Tesla K80 - #15 by Robert_Crovella) (the ACS things), but without success. If you think this is the issue, I can retry with any other commands that you provide.
Of course, feel free to ask for any additional information that could help.
Thank you in advance.