Issue with P2P connection using two RTX A4500

antoine.gaux · March 24, 2023, 3:15pm

Hello,

I am trying to configure NVLINK connection between two [NVIDIA RTX A4500]. However, I am not achieving expected performances, as demonstrated by the cuda samples:

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A4500 (GPU0) -> NVIDIA RTX A4500 (GPU1) : Yes
> Peer access from NVIDIA RTX A4500 (GPU1) -> NVIDIA RTX A4500 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.01GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

$ ./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A4500, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A4500, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 562.86  17.24 
     1  17.72 564.28 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 541.97   0.01 
     1   0.01 564.70 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 337.87  19.55 
     1  18.98 567.77 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 552.41   0.02 
     1   0.02 567.67 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.58  38.55 
     1  11.47   1.51 

   CPU     0      1 
     0   2.42   6.16 
     1   6.12   2.35 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.59 155.44 
     1 148.67   1.51 

   CPU     0      1 
     0   2.36   1.85 
     1   1.75   2.35 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The peer-to-peer access seems to work, but it is very slow.

Here is the output of nvidia-smi:

Fri Mar 24 15:47:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4500    On   | 00000000:4F:00.0  On |                  Off |
| 30%   31C    P8    26W / 200W |    128MiB / 20470MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4500    On   | 00000000:52:00.0 Off |                  Off |
| 30%   32C    P8    27W / 200W |      5MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1615      G   /usr/lib/xorg/Xorg                 81MiB |
|    0   N/A  N/A      1987      G   /usr/bin/gnome-shell               45MiB |
|    1   N/A  N/A      1615      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Here is the output of nvidia-smi topo -m:

            GPU0	  GPU1	mlx5_0	mlx5_1	CPU Affinity	NUMA Affinity
GPU0	 X 	NV4	PXB	PXB	0-11,24-35	0
GPU1	NV4	 X 	PXB	PXB	0-11,24-35	0
mlx5_0	PXB	PXB	 X 	PIX		
mlx5_1	PXB	PXB	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Here is the output of nvidia-smi nvlink --status:

GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
	 Link 0: <inactive>
	 Link 1: <inactive>
	 Link 2: <inactive>
	 Link 3: <inactive>
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)
	 Link 0: <inactive>
	 Link 1: <inactive>
	 Link 2: <inactive>
	 Link 3: <inactive>

I am on Ubuntu 20.04, here is the mothercard:

MBD-X12DPG-OA6

Few things I discovered during my investigation, the strange output (in my opinion) of nvidia-smi nvlink -c / -p:

nvidia-smi nvlink -c

GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)

nvidia-smi nvlink -p

GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)

I already tried to adapt the solution found here (Multi-GPU Peer to Peer access failing on Tesla K80 - #15 by Robert_Crovella) (the ACS things), but without success. If you think this is the issue, I can retry with any other commands that you provide.

Of course, feel free to ask for any additional information that could help.

Thank you in advance.

Robert_Crovella · March 24, 2023, 5:12pm

Check the motherboard BIOS to see if there is a setting for IOMMU or VT-d. If so, try disabling it. From here:

PCI Access Control Services (ACS)

IO virtualization (also known as, VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCI point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. You can check whether ACS is enabled on PCI bridges by running:

sudo lspci -vvv | grep ACSCtl

If lines show “SrcValid+”, then ACS might be enabled. Looking at the full output of lspci, one can check if a PCI bridge has ACS enabled.

sudo lspci -vvv

If PCI switches have ACS enabled, it needs to be disabled. On some systems this can be done from the BIOS by disabling IO virtualization or VT-d.

antoine.gaux · March 27, 2023, 7:53am

Hello and thank you for responding.

However, I tried this solution and it did not work here. I disabled VT-d in BIOS and here is the return of :

sudo lspci -vvv | grep ACSCtl

		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

The return of the cuda sample is still the same:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A4500, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A4500, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 563.88  17.47 
     1  17.72 563.88 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 542.72   0.01 
     1   0.01 563.47 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 562.76  18.95 
     1  19.33 567.66 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 552.22   0.02 
     1   0.02 566.74 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.56  37.53 
     1  20.55   1.51 

   CPU     0      1 
     0   2.51   7.14 
     1   6.97   2.42 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.55 155.43 
     1 155.52   1.51 

   CPU     0      1 
     0   2.46   1.99 
     1   1.94   2.42 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

If you or someone have any other ideas, they are welcome.

Robert_Crovella · March 27, 2023, 1:12pm

Update the system bios to the latest.

antoine.gaux · March 27, 2023, 2:52pm

Unfortunately, it seems that I already have the latest BIOS version, as my BIOS indicates me version 1.4, build the 08/23/2022, which is the same version given here.

antoine.gaux · March 29, 2023, 7:43am

My issue is still there, but I may have new information that might help to understand the problem.

So after spending hours on searching the internet, I came to realize that in the specification of the motherboard, there is absolutely no mention of SLI support.

Would this mean that the motherboard is not compatible with NVLink ?

antoine.gaux · March 31, 2023, 11:53am

I managed to solve the problem.

Edit: TL;DR : sudo nvidia-xconfig --sli="mosaic" resolved my issue after a reboot.

The long version:

First, I tested my A4500s with NVLink on an other motherboard, not working under Ubuntu but I managed to make it work under Windows 10, following this intructions, which are:

On windows, go into Nvidia Control Panel
In 3D Settings, go into Configure SLI, …
Enable by clocking on Maximise 3D Performance

At this point, I managed to launch cuda-samples on windows and I obtained good results (~50 GB/s on p2p)

Returning on Ubuntu, it worked without further commands or setup.

However, returning on the previous motherboard, without Windows, I managed to find an alternative:

You want to modify you xorg.conf file, which should be located in /etc/X11/xorg.conf.
The option you want to turn on is SLI, which you can achieve by editing directly the file and add in the Screen section :

Option "SLI" "mosaic"

Be careful, this is for CUDA 11.7, driver version 515.86.01.
Here’s the source and where you can find more informations : https://download.nvidia.com/XFree86/Linux-x86_64/

You can also just use the nvidia-xconfig command this way:

sudo nvidia-xconfig --sli="mosaic"

Once you have edited one way or another the xorg.conf file, reboot and it should be working.

system · April 14, 2023, 11:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
P2p Bandwidth 150% higher than maximum achievable CUDA Programming and Performance cuda , ubuntu	10	2603	April 11, 2023
RTX 3090 + NVLink + CUDA P2P - not working on Linux or Windows, in different ways? CUDA Programming and Performance	9	7054	May 24, 2023
Multi-GPU model inference failing with 4 A6000s CUDA Programming and Performance cuda	1	529	March 6, 2024
Low P2P GPU bandwidth performance between GeForce GPUs CUDA Programming and Performance	20	500	October 9, 2024
P2P Communication Fails 1080ti->1080ti. IOMMU & ACS disabled Linux	2	1504	October 12, 2021
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	4996	November 27, 2018
Failed to simpleP2P CUDA Setup and Installation	3	74	August 26, 2024
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	3508	March 10, 2018
cudaMemcpyPeerAsync behavior for different hardware CUDA Programming and Performance cuda	6	390	May 13, 2024
Standard nVidia CUDA tests fail with dual RTX 4090 Linux box Linux	54	20497	April 29, 2024

Issue with P2P connection using two RTX A4500

PCI Access Control Services (ACS)

The long version:

Related topics