Low P2P GPU bandwidth performance between GeForce GPUs

I used the GeForce RTX 4090 graphics card to test the cuda sample, released from https://github.com/NVIDIA/cuda-samples/, the p2pBandwidthLatencyTest result is bad, there are two low value.
Why is P2P GPU bandwidth performance low?

4x RTX4090, 128G RAM(4*32GB DDR5-4000)

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 919.39  **17.79**  31.04  31.03
     1  **28.77** 923.74  31.11  31.02
     2  31.17  31.22 923.19  31.25
     3  31.18  31.19  31.31 923.46

After removing 2 devices, the performance reaches the theoretical value.
2x RTX4090, 128G RAM(4*32GB DDR5-4000)

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 918.58  31.24
     1  31.16 923.46

I tried to increase the maximum RAM capacity, but the result was the same as before
4x RTX4090, 256G RAM(4*64GB DDR5-4000)

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 918.31  **20.51**  30.13  28.48 
     1  **23.52** 922.92  29.83  30.39 
     2  28.36  29.70 923.64  29.69 
     3  30.93  30.18  29.42 923.46 

How to increase RTX4090’s P2P performance?
Are there any requirements for this cuda sample?
I look forward to your reply^^

What does the system look like? What type of CPU is used? Single socket, dual socket?

Are there four PCIe 4.0 x16 slots for the four GPUs to plug into? Does the CPU provide > 64 PCIe 4.0 lanes?

2 Likes

P2P is not supported on Ada desktop cards, see here.

1 Like

While true P2P is not possible there is a fall-back mode where communication is via the host and PCIe. I interpreted the question as inquiring why not all GPU pairs in this system have the same communication throughput via PCIe when there are four GPUs.

1 Like

Fair point and that does seem the gist of the OP’s question. I guess I stopped at,

1 Like

One observation, if the “Bidirectional”, in the results means full duplex, the 31GB/s figure would seem to indicate PCIe Gen3 performance?

1 Like

Or gen4 x8.

There is a project here adding P2P to 4090, which may be of interest.

2 Likes

4U SYSTEM(up to 8 GPUs). 2x Intel Xeon Silver 4410Y. Dual socket 4th Gen Xeon motherboard.

Yes, there are. There are 80 PCIe 4.0 lanes in processor, so there are 160 lanes in total.

Got it! I understand… true P2P may require nvlink or nvswitch to support it.
P2P for PCIe devices has been around for many years and has different derivative interpretations. I think so: It’s like the difference between P2P Access(direct!) and P2P Copy(via the host and PCIe).

I think… it’s Gen4 x16!
Gen4 x8 → 15.8GB/s
Gen4 x16 → 31.5GB/s
from https://en.wikipedia.org/wiki/PCI_Express

Sorry, you are right regarding the PCIe speed per lane.

The slow speed seems to appear only between devices 0 and 1. Have you removed device 2 and 3 in your second test with two GPUs?

You could change the source code of the bandwidth test and with 4 installed GPUs only work with 2 at the same time to see, if some link is over its capacity or if installing as many GPUs leads to some reconfiguration.

Which CPU’s lanes are the four cards assigned to?

Have you tried affinity settings to let the benchmark run on the on or the other CPU?

If you look at the chart you link, Note i at the bottom states, “In each direction”, so if the test is bidirectional or full duplex, the throughput should be twice this.

The same test run on 4090s with the P2P enabled driver refered to above, shows throughput of 50GB/s.

PCIe uses packetized transport. There are discrepancies between theoretical PCIe throughput and what is practically achievable at the supported packet size (I think 256 bytes these days, but don’t quote me on that). What I have seen in practice for unidirectional traffic is

Between 12GB/sec and 13 GB/sec for PCIe 3.0 x16 interface
Around 25 GB/sec for PCIe 4.0 x16 interface

To my knowledge there are no GPUs with a PCIe 5.0 interface yet. So if 50 GB/sec are reported for the RTX 4090, it stands to reason that this refers to bidirectional bandwidth, given that PCIe is a full-duplex interconnect.

As has been alluded to in posts by other participants, for best performance in a dual socket system it is important to have each GPU “talk” to the “near” CPU and its associated memory, otherwise inter-socket communication can become a bottleneck. For this, specify processor and memory affinity with a utility like numactl, or use any controls provides by the test app itself (I have not looked at what it offers).

I appreciate your reply^^. Can you provide the minimum or recommended system requirement for running p2pBandwidthLatencyTest? (such as CPU, number of PCIe lanes, RAM clock and size…

Yes, i have. One GPU device per CPU.

Yes, i did so, but same result with 4 installed GPUs only work with 2 (command: export CUDA_VISIBLE_DEVICES=“0,1”; ./p2pBandwidthLatencyTest)

Two cards under CPU0’s lanes, the other 2 cards under CPU1’s lanes.

I added code but it didn’t affect the result (command: taskset -c x )

Could you please read test log I provided to solve the doubts about p2p.

4 GPU System

$ nvidia-smi
Mon Sep 30 10:55:11 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:2A:00.0 Off |                  Off |
| 33%   31C    P8             11W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:3D:00.0 Off |                  Off |
| 33%   29C    P8             19W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:99:00.0 Off |                  Off |
| 35%   30C    P8             18W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        On  |   00000000:AB:00.0 Off |                  Off |
| 34%   31C    P8             23W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Not support

~/Tools/cuda-samples/Samples/0_Introduction/simpleP2P$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

TOPO

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    SYS     SYS     0-11,24-35      0               N/A
GPU1    NODE     X      SYS     SYS     0-11,24-35      0               N/A
GPU2    SYS     SYS      X      NODE    12-23,36-47     1               N/A
GPU3    SYS     SYS     NODE     X      12-23,36-47     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

128GB RAM, Xeon CPU x2

$ lsmem
RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes     0
0x0000000100000000-0x000000207fffffff  126G online       yes  2-64

Memory block size:         2G
Total online memory:     128G
Total offline memory:      0B
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   48
  On-line CPU(s) list:    0-47
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Silver 4410Y
....

p2pBandwidthLatencyTest

~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 3d, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 99, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 4090, pciBusID: ab, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CANNOT Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CANNOT Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CANNOT Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     0     0     0
     1       0     1     0     0
     2       0     0     1     0
     3       0     0     0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 910.02  22.11  22.14  22.14
     1  22.10 922.37  22.11  22.14
     2  22.14  22.17 919.66  22.24
     3  22.10  22.15  22.23 920.28
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 912.13  22.08  22.22  22.18
     1  22.11 922.92  22.20  22.17
     2  22.19  22.24 921.49  22.33
     3  22.22  22.15  22.22 921.39
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 919.39  17.79  31.04  31.03
     1  28.77 923.74  31.11  31.02
     2  31.17  31.22 923.19  31.25
     3  31.18  31.19  31.31 923.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 918.31  17.80  31.10  31.02
     1  28.54 924.56  31.14  31.12
     2  31.19  31.17 922.65  31.32
     3  31.19  31.12  31.30 923.19
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   1.44  10.99  10.38  10.47
     1  10.66   1.38  10.54  10.53
     2  10.42  10.37   1.39  10.25
     3  10.38  11.24  10.64   1.42

   CPU     0      1      2      3
     0   2.79   8.29   7.49   7.64
     1   8.12   2.62   7.70   7.48
     2   7.65   7.58   2.49   7.14
     3   7.66   7.63   7.15   2.49
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.43  10.99  10.45  10.92
     1  10.34   1.37  10.38  12.45
     2  10.41  10.58   1.37  10.27
     3  12.68  10.39  10.27   1.42

   CPU     0      1      2      3
     0   2.73   8.10   7.47   7.56
     1   8.07   2.62   7.39   7.46
     2   7.68   7.55   2.54   7.08
     3   7.75   7.63   7.09   2.53

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ export CUDA_VISIBLE_DEVICES="2,3"; ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 99, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: ab, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     0
     1       0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 910.55  22.27
     1  22.20 921.29
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 911.61  22.25
     1  22.19 922.37
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 917.77  30.69
     1  30.70 923.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 918.85  30.52
     1  30.67 923.66
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.41  18.49
     1  10.31   1.39

   CPU     0      1
     0   2.42   7.18
     1   7.00   2.36
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.41  10.42
     1  10.24   1.39

   CPU     0      1
     0   2.55   7.03
     1   6.91   2.37

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ export CUDA_VISIBLE_DEVICES="0,1"; ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 3d, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     0
     1       0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 908.96  22.10
     1  21.81 921.29
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 911.08  22.11
     1  21.85 921.83
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 917.96  20.03
     1  20.40 923.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 919.12  20.74
     1  21.05 923.19
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.43  11.07
     1  10.99   1.42

   CPU     0      1
     0   2.65   8.16
     1   8.03   2.56
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.42  11.16
     1  11.07   1.42

   CPU     0      1
     0   2.64   8.02
     1   8.02   2.55

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

2 GPU System (One GPU device per CPU)

$ nvidia-smi
Mon Sep 30 09:35:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:2A:00.0 Off |                  Off |
| 34%   31C    P8             10W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:99:00.0 Off |                  Off |
| 33%   29C    P8             19W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

Not support

$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

TOPO

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     0-11,24-35      0               N/A
GPU1    SYS      X      12-23,36-47     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

128GB RAM, Xeon CPU x2

$ lsmem
RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes     0
0x0000000100000000-0x000000207fffffff  126G online       yes  2-64

Memory block size:         2G
Total online memory:     128G
Total offline memory:      0B
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   48
  On-line CPU(s) list:    0-47
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Silver 4410Y
....

p2pBandwidthLatencyTest

~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 99, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     0
     1       0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 911.66  21.97
     1  22.13 921.83
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 912.68  21.95
     1  22.14 922.37
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 918.81  30.41
     1  30.24 923.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 918.85  30.38
     1  30.29 923.74
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.42  12.51
     1  11.34   1.35

   CPU     0      1
     0   2.74   7.51
     1   7.68   2.37
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.42  10.92
     1  10.38   1.34

   CPU     0      1
     0   2.65   7.44
     1   7.61   2.39

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

2 GPU System (Two GPU devices under CPU0’s lanes)

$ nvidia-smi
Mon Sep 30 10:25:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:2A:00.0 Off |                  Off |
| 34%   32C    P8             10W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:3D:00.0 Off |                  Off |
| 34%   31C    P8             18W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

Not support

$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

TOPO

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    0-11,24-35      0               N/A
GPU1    NODE     X      0-11,24-35      0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

128GB RAM, Xeon CPU x2

$ lsmem
RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes     0
0x0000000100000000-0x000000207fffffff  126G online       yes  2-64

Memory block size:         2G
Total online memory:     128G
Total offline memory:      0B
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   48
  On-line CPU(s) list:    0-47
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Silver 4410Y
....

p2pBandwidthLatencyTest

~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 3d, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     0
     1       0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 909.49  22.23
     1  22.16 921.29
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 912.68  22.23
     1  22.20 920.74
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 918.58  31.24
     1  31.16 923.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 918.58  31.25
     1  31.19 924.56
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.42  18.48
     1  10.27   1.36

   CPU     0      1
     0   2.49   7.07
     1   6.89   2.35
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.42  10.25
     1  10.33   1.37

   CPU     0      1
     0   2.48   6.94
     1   6.97   2.34

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

2 GPU System (Two GPU devices under CPU1’s lanes)

$ nvidia-smi
Mon Sep 30 11:27:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:99:00.0 Off |                  Off |
| 36%   31C    P8             18W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:AB:00.0 Off |                  Off |
| 34%   31C    P8             23W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

Not support

$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

TOPO

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    12-23,36-47     1               N/A
GPU1    NODE     X      12-23,36-47     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

128GB RAM, Xeon CPU x2

$ lsmem
RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes     0
0x0000000100000000-0x000000207fffffff  126G online       yes  2-64

Memory block size:         2G
Total online memory:     128G
Total offline memory:      0B
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   48
  On-line CPU(s) list:    0-47
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Silver 4410Y
....

p2pBandwidthLatencyTest

~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 99, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: ab, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     0
     1       0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 909.49  22.22
     1  22.22 920.27
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 911.61  22.24
     1  22.22 921.15
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 917.84  31.17
     1  31.25 923.19
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 919.11  31.23
     1  31.28 922.92
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.41  10.32
     1  10.31   1.39

   CPU     0      1
     0   2.48   7.06
     1   6.89   2.36
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.41  10.25
     1  18.49   1.39

   CPU     0      1
     0   2.47   7.02
     1   7.28   2.48

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

In order to discount a rogue PCIe connection, with four cards installed and while running the P2P test, what does the command:

nvidia-smi --format=csv --query-gpu=pcie.link.gen.current,pcie.link.width.current

show?

The link info defaults to lower modes when idle, so you need to run it while the cards are under load.

Yes! All 4.0 x16 while the cards are under load.
watch -n 1 'nvidia-smi --query-gpu="pcie.link.gen.current,pcie.link.width.current" --format=csv'

Reconfirm:

$ watch -n 1 'nvidia-smi --query-gpu="pcie.link.gen.current,pcie.link.width.current" --format=csv'
Every 1.0s: nvidia-smi --query-gpu="pcie.link.gen.current,pcie.link.width.current" --format=csv                               
pcie.link.gen.current, pcie.link.width.current
4, 16
4, 16
4, 16
4, 16
1 Like

A 4U system with support for up to 8 GPUs almost certainly has PCIE switches in addition to what is provided by the CPU sockets. The topo -m output suggests this as well.

I usually recommend the following in such situations:

  1. Update to the latest system bios provided by your server vendor. Check with the server vendor for optimal/recommended settings.
  2. Update to the latest GPU driver version. You appear to be using a version 550.x Update to at least 560.35.03 or newer.
  3. provide actual server PCIE topology. The recommended way to do this is to use your server vendor to provide the technical documents for this system. in many cases, the user guide or technical guide will include an actual PCIE topology diagram, showing interconnection of PCIE between sockets and switches, and between switches and slots, as well as the lane width of each connection.

These are just suggestions/recommendations, I don’t know exactly what is happening here. If the updates suggested above don’t improve anything, and the topology diagram indicates no issues, then it could be a platform settings issue, which you would need to address to the platform/server vendor.

1 Like