Issue with P2P connection using two RTX A4500

Hello,

I am trying to configure NVLINK connection between two [NVIDIA RTX A4500]. However, I am not achieving expected performances, as demonstrated by the cuda samples:

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A4500 (GPU0) -> NVIDIA RTX A4500 (GPU1) : Yes
> Peer access from NVIDIA RTX A4500 (GPU1) -> NVIDIA RTX A4500 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.01GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

$ ./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A4500, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A4500, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 562.86  17.24 
     1  17.72 564.28 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 541.97   0.01 
     1   0.01 564.70 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 337.87  19.55 
     1  18.98 567.77 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 552.41   0.02 
     1   0.02 567.67 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.58  38.55 
     1  11.47   1.51 

   CPU     0      1 
     0   2.42   6.16 
     1   6.12   2.35 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.59 155.44 
     1 148.67   1.51 

   CPU     0      1 
     0   2.36   1.85 
     1   1.75   2.35 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The peer-to-peer access seems to work, but it is very slow.

Here is the output of nvidia-smi:

Fri Mar 24 15:47:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4500    On   | 00000000:4F:00.0  On |                  Off |
| 30%   31C    P8    26W / 200W |    128MiB / 20470MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4500    On   | 00000000:52:00.0 Off |                  Off |
| 30%   32C    P8    27W / 200W |      5MiB / 20470MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1615      G   /usr/lib/xorg/Xorg                 81MiB |
|    0   N/A  N/A      1987      G   /usr/bin/gnome-shell               45MiB |
|    1   N/A  N/A      1615      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Here is the output of nvidia-smi topo -m:

            GPU0	  GPU1	mlx5_0	mlx5_1	CPU Affinity	NUMA Affinity
GPU0	 X 	NV4	PXB	PXB	0-11,24-35	0
GPU1	NV4	 X 	PXB	PXB	0-11,24-35	0
mlx5_0	PXB	PXB	 X 	PIX		
mlx5_1	PXB	PXB	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Here is the output of nvidia-smi nvlink --status:

GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
	 Link 0: <inactive>
	 Link 1: <inactive>
	 Link 2: <inactive>
	 Link 3: <inactive>
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)
	 Link 0: <inactive>
	 Link 1: <inactive>
	 Link 2: <inactive>
	 Link 3: <inactive>

I am on Ubuntu 20.04, here is the mothercard:

MBD-X12DPG-OA6

Few things I discovered during my investigation, the strange output (in my opinion) of nvidia-smi nvlink -c / -p:

nvidia-smi nvlink -c

GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)

nvidia-smi nvlink -p

GPU 0: NVIDIA RTX A4500 (UUID: GPU-794fc296-8027-c900-183f-29e9774fb44a)
GPU 1: NVIDIA RTX A4500 (UUID: GPU-55727cbb-2894-ced1-c32f-750d8b95c1e2)

I already tried to adapt the solution found here (Multi-GPU Peer to Peer access failing on Tesla K80 - #15 by Robert_Crovella) (the ACS things), but without success. If you think this is the issue, I can retry with any other commands that you provide.

Of course, feel free to ask for any additional information that could help.

Thank you in advance.

Check the motherboard BIOS to see if there is a setting for IOMMU or VT-d. If so, try disabling it. From here:

PCI Access Control Services (ACS)

IO virtualization (also known as, VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCI point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. You can check whether ACS is enabled on PCI bridges by running:

sudo lspci -vvv | grep ACSCtl

If lines show “SrcValid+”, then ACS might be enabled. Looking at the full output of lspci, one can check if a PCI bridge has ACS enabled.

sudo lspci -vvv

If PCI switches have ACS enabled, it needs to be disabled. On some systems this can be done from the BIOS by disabling IO virtualization or VT-d.

Hello and thank you for responding.

However, I tried this solution and it did not work here. I disabled VT-d in BIOS and here is the return of :

sudo lspci -vvv | grep ACSCtl

		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

The return of the cuda sample is still the same:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A4500, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A4500, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 563.88  17.47 
     1  17.72 563.88 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 542.72   0.01 
     1   0.01 563.47 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 562.76  18.95 
     1  19.33 567.66 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 552.22   0.02 
     1   0.02 566.74 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.56  37.53 
     1  20.55   1.51 

   CPU     0      1 
     0   2.51   7.14 
     1   6.97   2.42 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.55 155.43 
     1 155.52   1.51 

   CPU     0      1 
     0   2.46   1.99 
     1   1.94   2.42 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

If you or someone have any other ideas, they are welcome.

Update the system bios to the latest.

Unfortunately, it seems that I already have the latest BIOS version, as my BIOS indicates me version 1.4, build the 08/23/2022, which is the same version given here.

1 Like

My issue is still there, but I may have new information that might help to understand the problem.

So after spending hours on searching the internet, I came to realize that in the specification of the motherboard, there is absolutely no mention of SLI support.

Would this mean that the motherboard is not compatible with NVLink ?

1 Like

I managed to solve the problem.

Edit: TL;DR : sudo nvidia-xconfig --sli="mosaic" resolved my issue after a reboot.

The long version:

First, I tested my A4500s with NVLink on an other motherboard, not working under Ubuntu but I managed to make it work under Windows 10, following this intructions, which are:

  1. On windows, go into Nvidia Control Panel
  2. In 3D Settings, go into Configure SLI, …
  3. Enable by clocking on Maximise 3D Performance

At this point, I managed to launch cuda-samples on windows and I obtained good results (~50 GB/s on p2p)

Returning on Ubuntu, it worked without further commands or setup.

However, returning on the previous motherboard, without Windows, I managed to find an alternative:

You want to modify you xorg.conf file, which should be located in /etc/X11/xorg.conf.
The option you want to turn on is SLI, which you can achieve by editing directly the file and add in the Screen section :

Option "SLI" "mosaic"

Be careful, this is for CUDA 11.7, driver version 515.86.01.
Here’s the source and where you can find more informations : https://download.nvidia.com/XFree86/Linux-x86_64/

You can also just use the nvidia-xconfig command this way:

sudo nvidia-xconfig --sli="mosaic"

Once you have edited one way or another the xorg.conf file, reboot and it should be working.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.