Here’s the cut-down deviceQuery sample. It’s highly non-interesting since the DGX-2 architecture with NVSwich means that every GPU can communicate with every other! (This is from a DGX-2H, so some of the GPU-specific details may differ slightly from your DGX-2.)
root@196d745bcfa1:/usr/local/cuda/samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 16 CUDA Capable device(s)
Device 0: "Tesla V100-SXM3-32GB-H"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 32480 MBytes (34058272768 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1702 MHz (1.70 GHz)
Memory Clock rate: 1107 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 4 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 52 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
[snip]
Device 15: "Tesla V100-SXM3-32GB-H"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 32480 MBytes (34058272768 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1702 MHz (1.70 GHz)
Memory Clock rate: 1107 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 4 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 231 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU1) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU2) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU3) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU4) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU5) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU6) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU7) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU8) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU9) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU10) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU11) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU12) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU13) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU14) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU0) -> Tesla V100-SXM3-32GB-H (GPU15) : Yes
[snip]
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU0) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU1) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU2) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU3) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU4) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU5) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU6) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU7) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU8) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU9) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU10) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU11) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU12) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU13) : Yes
> Peer access from Tesla V100-SXM3-32GB-H (GPU15) -> Tesla V100-SXM3-32GB-H (GPU14) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 16
Result = PASS
What is more interesting is the topology showing 6x NVLink connections between all of the GPUs:
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU9 GPU10 GPU11 GPU12 GPU13 GPU14 GPU15 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity
GPU0 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 PIX PXB NODE NODE SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU1 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 PIX PXB NODE NODE SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU2 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 PXB PIX NODE NODE SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU3 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 PXB PIX NODE NODE SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU4 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NODE NODE PIX PXB SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU5 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NODE NODE PIX PXB SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NODE NODE PXB PIX SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU7 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NODE NODE PXB PIX SYSSYS SYS SYS SYS SYS 0-23,48-71
GPU8 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 NV6 SYS SYS SYS SYS NODE NODE PIX PXB NODE NODE 24-47,72-95
GPU9 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 NV6 SYS SYS SYS SYS NODE NODE PIX PXB NODE NODE 24-47,72-95
GPU10 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 NV6 SYS SYS SYS SYS NODE NODE PXB PIX NODE NODE 24-47,72-95
GPU11 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 NV6 SYS SYS SYS SYS NODE NODE PXB PIX NODE NODE 24-47,72-95
GPU12 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 NV6 SYS SYS SYS SYS NODE NODE NODE NODE PIX PXB 24-47,72-95
GPU13 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 NV6 SYS SYS SYS SYS NODE NODE NODE NODE PIX PXB 24-47,72-95
GPU14 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X NV6 SYS SYS SYS SYS NODE NODE NODE NODE PXB PIX 24-47,72-95
GPU15 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 NV6 X SYS SYS SYS SYS NODE NODE NODE NODE PXB PIX 24-47,72-95
mlx5_0 PIX PIX PXB PXB NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS X PXB NODE NODE SYSSYS SYS SYS SYS SYS
mlx5_1 PXB PXB PIX PIX NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS PXB X NODE NODE SYSSYS SYS SYS SYS SYS
mlx5_2 NODE NODE NODE NODE PIX PIX PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE X PXB SYSSYS SYS SYS SYS SYS
mlx5_3 NODE NODE NODE NODE PXB PXB PIX PIX SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE PXB X SYSSYS SYS SYS SYS SYS
mlx5_4 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS X PIX NODE NODE NODE NODE
mlx5_5 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS PIX X NODE NODE NODE NODE
mlx5_6 SYS SYS SYS SYS SYS SYS SYS SYS PIX PIX PXB PXB NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X PXB NODE NODE
mlx5_7 SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX PIX NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE PXB X NODE NODE
mlx5_8 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PXB PXB SYS SYS SYS SYS NODE NODE NODE NODE X PXB
mlx5_9 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB PXB PIX PIX SYS SYS SYS SYS NODE NODE NODE NODE PXB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
All that means that you end up with (for example in p2pBandwidthLatencyTest) things like this:
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 916.96 263.93 265.66 265.20 265.38 268.34 265.47 266.01 263.52 262.59 265.99 265.99 265.64 265.64 267.72 268.18
1 263.85 922.37 267.57 266.68 264.57 267.45 266.55 264.92 263.30 263.84 267.20 266.91 266.34 264.91 266.93 266.91
2 265.11 265.20 922.37 267.80 265.29 269.30 266.19 269.50 264.87 264.63 268.37 268.09 265.64 267.33 269.66 268.93
3 265.47 264.39 267.47 920.20 265.92 268.20 266.37 266.73 263.84 263.66 266.72 266.72 266.54 266.72 267.63 267.64
4 265.28 264.48 267.60 265.65 916.96 269.04 265.47 267.30 264.63 264.70 267.98 265.63 265.45 264.91 268.13 268.21
5 266.55 265.46 269.87 268.75 267.09 924.56 266.55 268.94 263.47 265.89 269.48 268.82 265.81 266.35 269.11 269.11
6 266.01 264.39 267.28 266.65 264.39 267.10 924.56 265.65 264.01 265.14 267.09 266.90 265.82 266.72 266.90 266.72
7 265.28 264.57 269.50 265.47 265.29 268.57 265.57 915.89 264.19 264.73 268.56 268.74 266.54 267.16 267.45 268.19
8 263.63 263.53 266.84 266.34 263.56 266.90 266.18 266.30 922.37 264.03 266.78 266.97 266.04 264.75 266.56 267.11
9 264.18 264.00 266.87 264.52 264.72 267.80 266.34 266.91 264.03 923.46 266.54 264.03 264.75 264.57 265.86 267.28
10 266.59 264.89 268.16 265.41 265.44 269.47 267.46 268.90 265.29 264.67 925.65 267.56 266.37 267.10 269.12 268.94
11 266.04 264.54 267.62 266.96 265.08 268.36 267.57 268.53 265.28 263.85 268.02 924.56 268.00 265.29 268.94 268.94
12 266.10 263.82 267.07 266.71 265.86 267.25 265.44 266.16 264.03 264.38 267.83 267.75 924.56 266.01 266.55 266.73
13 265.80 264.18 267.62 266.71 265.03 267.44 266.89 267.07 264.21 263.85 267.10 266.55 266.09 922.37 268.64 268.97
14 265.61 265.02 268.92 266.52 265.98 269.09 265.62 267.75 264.92 263.32 269.12 267.10 265.47 266.28 923.46 268.76
15 265.98 265.04 269.27 266.71 266.16 268.80 265.44 267.82 264.75 265.29 269.05 268.94 265.83 266.92 268.74 913.74