Multi-GPU model inference failing with 4 A6000s

lsh950919 · March 6, 2024, 5:35am

Hello,
I am having problems loading text generation models on multiple GPUs. After following this issue on github and a post on this forum regarding similar problems, I have gone through some testing with cuda-samples to find that it might be an ACS related problem between the GPUs.

I have done both simpleP2P and p2pBandwidthLatencyTest and these are the results

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A6000 (GPU0) -> NVIDIA RTX A6000 (GPU1) : Yes
> Peer access from NVIDIA RTX A6000 (GPU0) -> NVIDIA RTX A6000 (GPU2) : Yes
> Peer access from NVIDIA RTX A6000 (GPU0) -> NVIDIA RTX A6000 (GPU3) : Yes
> Peer access from NVIDIA RTX A6000 (GPU1) -> NVIDIA RTX A6000 (GPU0) : Yes
> Peer access from NVIDIA RTX A6000 (GPU1) -> NVIDIA RTX A6000 (GPU2) : Yes
> Peer access from NVIDIA RTX A6000 (GPU1) -> NVIDIA RTX A6000 (GPU3) : Yes
> Peer access from NVIDIA RTX A6000 (GPU2) -> NVIDIA RTX A6000 (GPU0) : Yes
> Peer access from NVIDIA RTX A6000 (GPU2) -> NVIDIA RTX A6000 (GPU1) : Yes
> Peer access from NVIDIA RTX A6000 (GPU2) -> NVIDIA RTX A6000 (GPU3) : Yes
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA RTX A6000 (GPU0) : Yes
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA RTX A6000 (GPU1) : Yes
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA RTX A6000 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.92GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA RTX A6000, pciBusID: 56, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA RTX A6000, pciBusID: 57, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     1     1
     1       1     1     1     1
     2       1     1     1     1
     3       1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 673.20  15.32  16.13  16.15
     1  16.13 673.78  16.15  16.15
     2  16.14  16.12 673.20  16.15
     3  16.15  16.13  16.15 673.20
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 672.91   1.67   1.59   1.60
     1   1.64 673.78   2.02   2.01
     2   2.08   1.57 673.49   2.10
     3   1.77   1.78   1.75 673.78
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 677.58  17.52  18.48  18.43
     1  18.35 678.46  18.22  18.47
     2  18.28  18.36 677.73  18.26
     3  18.33  18.42  18.43 678.02
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 677.29   3.36   3.33   3.16
     1   3.21 677.29   3.42   3.04
     2   3.05   3.08 678.17   3.33
     3   3.00   3.04   3.38 678.17
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   1.56  17.11  14.82  17.70
     1  13.73   1.56  16.61  14.03
     2  17.25  17.92   1.58  16.34
     3  14.07  13.32  15.88   1.62

   CPU     0      1      2      3
     0   3.30   9.75   9.53   9.60
     1   9.51   3.35   9.17   8.94
     2   8.98   9.12   3.07   8.94
     3   8.85   8.82   8.70   3.07
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.66 49204.78 49204.66 49204.65
     1 49204.69   1.56 49204.65 49204.61
     2 49204.92 49204.98   1.60 49205.09
     3 49204.69 49204.67 49204.63   1.62

   CPU     0      1      2      3
     0   7.29   5.45   6.48   6.89
     1   6.85   7.14   2.89   5.93
     2   2.45   6.97   9.57   6.43
     3   3.97   6.78   6.85   9.21

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

this is the print message for nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      PXB     PXB     PXB     0-11,24-35      0
GPU1    PXB      X      PXB     PXB     0-11,24-35      0
GPU2    PXB     PXB      X      PIX     0-11,24-35      0
GPU3    PXB     PXB     PIX      X      0-11,24-35      0

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

and to me all of the results seem like there are connections between GPUs but something is blocking the transfer of data.

And for lspci -vvv | grep ACSCtl:

                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

I see that there are some PCI bridges that has ACS enabled with SrcValid+, but being new to the hardware communication stuff I am stuck on trying to find which bridges have ACS enabled.

If someone could help me understand:

Am I correctly interpreting the p2p latency results as “there is a bottleneck that prevents data from being transferred from one GPU to another”?
How would I proceed to find the bridges with ACS enabled to disable them? To be exact, how would I find the IDs that goes into the setpci command?
What side-effects might disabling ACS have?

The last question is regarding whether it would be safe for me to disable ACS, since I do not have much background on this other than the fact that it acts as a safety measure for transferring data between the GPUS.

Robert_Crovella · March 6, 2024, 4:51pm

The way I would view things here is that this:

and

indicates a broken setup. If you want to call it a “bottleneck” feel free. But it is broken.

lspci is the typical tool used. I won’t be able to give a tutorial here, but you can find various writeups, including responses I have made on these forums. the -vvv switch and the switch that selects tree output are relevant here. lspci has a man page as well as command-line help.

You’ve made a few statements here such as “I do not have much background on this” that lead me to suggest that you may be better off 1. updating the system bios on your motherboard and 2. take this issue up with the motherboard vendor.

Even if you educate yourself and learn how to make changes to these PCI settings, they will not persist through a reboot, and there is nothing you can do to change that, other than repeating your changes every time the system boots. If you need changes to these settings, the right way is via the system BIOS.

This and this may be of interest.

Topic		Replies	Views
Issue with P2P connection using two RTX A4500 CUDA Programming and Performance cuda , ubuntu	7	2296	March 31, 2023
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2232	October 12, 2021
P2p Bandwidth 150% higher than maximum achievable CUDA Programming and Performance cuda , ubuntu	10	2480	April 11, 2023
cudaMemcpyPeerAsync behavior for different hardware CUDA Programming and Performance cuda	6	365	May 13, 2024
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	4939	November 27, 2018
Multi-GPU Peer to Peer access failing on Tesla K80 CUDA Programming and Performance	25	25126	November 24, 2016
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2608	April 11, 2019
Peer access not supported between devices CUDA Programming and Performance	11	6763	November 9, 2017
[P2P] Device couldn't access another one with p2pBandwidthLatencyTest CUDA Programming and Performance	3	1630	July 14, 2022
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	3794	November 7, 2023

Multi-GPU model inference failing with 4 A6000s

Related topics