2 GPU System (One GPU device per CPU)
$ nvidia-smi
Mon Sep 30 09:35:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:2A:00.0 Off | Off |
| 34% 31C P8 10W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:99:00.0 Off | Off |
| 33% 29C P8 19W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
Not support
$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
TOPO
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 0-11,24-35 0 N/A
GPU1 SYS X 12-23,36-47 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
128GB RAM, Xeon CPU x2
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online yes 0
0x0000000100000000-0x000000207fffffff 126G online yes 2-64
Memory block size: 2G
Total online memory: 128G
Total offline memory: 0B
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Silver 4410Y
....
p2pBandwidthLatencyTest
~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 99, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 0
1 0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 911.66 21.97
1 22.13 921.83
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 912.68 21.95
1 22.14 922.37
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 918.81 30.41
1 30.24 923.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 918.85 30.38
1 30.29 923.74
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.42 12.51
1 11.34 1.35
CPU 0 1
0 2.74 7.51
1 7.68 2.37
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.42 10.92
1 10.38 1.34
CPU 0 1
0 2.65 7.44
1 7.61 2.39
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
2 GPU System (Two GPU devices under CPU0’s lanes)
$ nvidia-smi
Mon Sep 30 10:25:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:2A:00.0 Off | Off |
| 34% 32C P8 10W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:3D:00.0 Off | Off |
| 34% 31C P8 18W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
Not support
$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
TOPO
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 0-11,24-35 0 N/A
GPU1 NODE X 0-11,24-35 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
128GB RAM, Xeon CPU x2
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online yes 0
0x0000000100000000-0x000000207fffffff 126G online yes 2-64
Memory block size: 2G
Total online memory: 128G
Total offline memory: 0B
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Silver 4410Y
....
p2pBandwidthLatencyTest
~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 3d, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 0
1 0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 909.49 22.23
1 22.16 921.29
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 912.68 22.23
1 22.20 920.74
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 918.58 31.24
1 31.16 923.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 918.58 31.25
1 31.19 924.56
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.42 18.48
1 10.27 1.36
CPU 0 1
0 2.49 7.07
1 6.89 2.35
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.42 10.25
1 10.33 1.37
CPU 0 1
0 2.48 6.94
1 6.97 2.34
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
2 GPU System (Two GPU devices under CPU1’s lanes)
$ nvidia-smi
Mon Sep 30 11:27:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:99:00.0 Off | Off |
| 36% 31C P8 18W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:AB:00.0 Off | Off |
| 34% 31C P8 23W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
Not support
$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
TOPO
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 12-23,36-47 1 N/A
GPU1 NODE X 12-23,36-47 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
128GB RAM, Xeon CPU x2
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online yes 0
0x0000000100000000-0x000000207fffffff 126G online yes 2-64
Memory block size: 2G
Total online memory: 128G
Total offline memory: 0B
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Silver 4410Y
....
p2pBandwidthLatencyTest
~/Tools/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 99, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: ab, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 0
1 0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 909.49 22.22
1 22.22 920.27
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 911.61 22.24
1 22.22 921.15
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 917.84 31.17
1 31.25 923.19
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 919.11 31.23
1 31.28 922.92
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.41 10.32
1 10.31 1.39
CPU 0 1
0 2.48 7.06
1 6.89 2.36
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.41 10.25
1 18.49 1.39
CPU 0 1
0 2.47 7.02
1 7.28 2.48
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.