Hello,
I am having problems loading text generation models on multiple GPUs. After following this issue on github and a post on this forum regarding similar problems, I have gone through some testing with cuda-samples to find that it might be an ACS related problem between the GPUs.
I have done both simpleP2P and p2pBandwidthLatencyTest and these are the results
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A6000 (GPU0) -> NVIDIA RTX A6000 (GPU1) : Yes
> Peer access from NVIDIA RTX A6000 (GPU0) -> NVIDIA RTX A6000 (GPU2) : Yes
> Peer access from NVIDIA RTX A6000 (GPU0) -> NVIDIA RTX A6000 (GPU3) : Yes
> Peer access from NVIDIA RTX A6000 (GPU1) -> NVIDIA RTX A6000 (GPU0) : Yes
> Peer access from NVIDIA RTX A6000 (GPU1) -> NVIDIA RTX A6000 (GPU2) : Yes
> Peer access from NVIDIA RTX A6000 (GPU1) -> NVIDIA RTX A6000 (GPU3) : Yes
> Peer access from NVIDIA RTX A6000 (GPU2) -> NVIDIA RTX A6000 (GPU0) : Yes
> Peer access from NVIDIA RTX A6000 (GPU2) -> NVIDIA RTX A6000 (GPU1) : Yes
> Peer access from NVIDIA RTX A6000 (GPU2) -> NVIDIA RTX A6000 (GPU3) : Yes
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA RTX A6000 (GPU0) : Yes
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA RTX A6000 (GPU1) : Yes
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA RTX A6000 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.92GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 4f, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 52, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA RTX A6000, pciBusID: 56, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA RTX A6000, pciBusID: 57, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 673.20 15.32 16.13 16.15
1 16.13 673.78 16.15 16.15
2 16.14 16.12 673.20 16.15
3 16.15 16.13 16.15 673.20
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 672.91 1.67 1.59 1.60
1 1.64 673.78 2.02 2.01
2 2.08 1.57 673.49 2.10
3 1.77 1.78 1.75 673.78
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 677.58 17.52 18.48 18.43
1 18.35 678.46 18.22 18.47
2 18.28 18.36 677.73 18.26
3 18.33 18.42 18.43 678.02
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 677.29 3.36 3.33 3.16
1 3.21 677.29 3.42 3.04
2 3.05 3.08 678.17 3.33
3 3.00 3.04 3.38 678.17
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 1.56 17.11 14.82 17.70
1 13.73 1.56 16.61 14.03
2 17.25 17.92 1.58 16.34
3 14.07 13.32 15.88 1.62
CPU 0 1 2 3
0 3.30 9.75 9.53 9.60
1 9.51 3.35 9.17 8.94
2 8.98 9.12 3.07 8.94
3 8.85 8.82 8.70 3.07
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 1.66 49204.78 49204.66 49204.65
1 49204.69 1.56 49204.65 49204.61
2 49204.92 49204.98 1.60 49205.09
3 49204.69 49204.67 49204.63 1.62
CPU 0 1 2 3
0 7.29 5.45 6.48 6.89
1 6.85 7.14 2.89 5.93
2 2.45 6.97 9.57 6.43
3 3.97 6.78 6.85 9.21
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
this is the print message for nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X PXB PXB PXB 0-11,24-35 0
GPU1 PXB X PXB PXB 0-11,24-35 0
GPU2 PXB PXB X PIX 0-11,24-35 0
GPU3 PXB PXB PIX X 0-11,24-35 0
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
and to me all of the results seem like there are connections between GPUs but something is blocking the transfer of data.
And for lspci -vvv | grep ACSCtl
:
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
I see that there are some PCI bridges that has ACS enabled with SrcValid+, but being new to the hardware communication stuff I am stuck on trying to find which bridges have ACS enabled.
If someone could help me understand:
- Am I correctly interpreting the p2p latency results as “there is a bottleneck that prevents data from being transferred from one GPU to another”?
- How would I proceed to find the bridges with ACS enabled to disable them? To be exact, how would I find the IDs that goes into the setpci command?
- What side-effects might disabling ACS have?
The last question is regarding whether it would be safe for me to disable ACS, since I do not have much background on this other than the fact that it acts as a safety measure for transferring data between the GPUS.