P2P not working for P600s?

enfiskutensykkel · April 3, 2018, 12:14pm

Hi,

I have two K420s that I recently replaced with two P600s, but it appears that P2P is not working for the P600s.
However, it does work for K420s.

I was under the impression that P2P is supposed to work for identical cards, even GeForce cards. Has this policy changed?

Here is the output from simpleP2P from the NVIDIA samples:

[root@metty simpleP2P]# ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 3
> GPU0 = "GeForce GTX 1050" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "    Quadro P600" IS  capable of Peer-to-Peer (P2P)
> GPU2 = "    Quadro P600" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce GTX 1050 (GPU0) -> Quadro P600 (GPU1) : No
> Peer access from GeForce GTX 1050 (GPU0) -> Quadro P600 (GPU2) : No
> Peer access from Quadro P600 (GPU1) -> GeForce GTX 1050 (GPU0) : No
> Peer access from Quadro P600 (GPU1) -> Quadro P600 (GPU2) : No
> Peer access from Quadro P600 (GPU2) -> GeForce GTX 1050 (GPU0) : No
> Peer access from Quadro P600 (GPU2) -> Quadro P600 (GPU1) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

And some nvidia-smi output:

[root@metty simpleP2P]# nvidia-smi
Tue Apr  3 13:57:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:05:00.0 Off |                  N/A |
| 35%   40C    P0    N/A /  75W |      0MiB /  1999MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P600         Off  | 00000000:0B:00.0 Off |                  N/A |
| 36%   50C    P0    N/A /  N/A |      0MiB /  2000MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro P600         Off  | 00000000:0C:00.0 Off |                  N/A |
|  0%   67C    P0    N/A /  N/A |      0MiB /  2000MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[root@metty simpleP2P]# nvidia-smi topo -m
	GPU0	GPU1	GPU2	CPU Affinity
GPU0	 X 	PHB	PHB	0-5
GPU1	PHB	 X 	PIX	0-5
GPU2	PHB	PIX	 X 	0-5

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

[root@metty simpleP2P]# nvidia-smi topo -p2p w
 	GPU0	GPU1	GPU2
 GPU0	X	GNS	GNS
 GPU1	GNS	X	GNS
 GPU2	GNS	GNS	X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

For the K420s, P2P works perfectly:

[root@metty p2pBandwidthLatencyTest]# nvidia-smi -L
GPU 0: GeForce GTX 1050 (UUID: GPU-578cae79-a799-351b-1b29-157171e6af0d)
GPU 1: Quadro K420 (UUID: GPU-30178a26-07b7-42a4-03bd-cf08253d89ae)
GPU 2: Quadro K420 (UUID: GPU-f81abec5-ef46-4ff7-4216-2d1786323335)

[root@metty p2pBandwidthLatencyTest]# nvidia-smi topo -m
	GPU0	GPU1	GPU2	CPU Affinity
GPU0	 X 	PHB	PHB	0-5
GPU1	PHB	 X 	PIX	0-5
GPU2	PHB	PIX	 X 	0-5

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

[root@metty p2pBandwidthLatencyTest]# nvidia-smi topo -p2p rw
     GPU0    GPU1    GPU2
 GPU0    X    NS    NS
 GPU1    NS    X    OK
 GPU2    NS    OK    X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
[root@metty p2pBandwidthLatencyTest]# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1050, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 1, Quadro K420, pciBusID: b, pciDeviceID: 0, pciDomainID:0
Device: 2, Quadro K420, pciBusID: c, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CANNOT Access Peer Device=0
Device=2 CAN Access Peer Device=1
...

I’m using Linux kernel version 4.15, Nvidia driver 390.30 and CUDA 9.1, if that is of any interest.

EDIT: Just for curiosity, I tried with two K420s and one P600.

[root@metty simpleP2P]# ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 3
> GPU0 = "    Quadro P600" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "    Quadro K420" IS  capable of Peer-to-Peer (P2P)
> GPU2 = "    Quadro K420" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU1) : No
> Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU2) : No
> Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU0) : No
> Peer access from Quadro K420 (GPU1) -> Quadro K420 (GPU2) : Yes
> Peer access from Quadro K420 (GPU2) -> Quadro P600 (GPU0) : No
> Peer access from Quadro K420 (GPU2) -> Quadro K420 (GPU1) : Yes
Enabling peer access between GPU1 and GPU2...
Checking GPU1 and GPU2 for UVA capabilities...
> Quadro K420 (GPU1) supports UVA: Yes
> Quadro K420 (GPU2) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU1, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU2: 5.64GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU2, taking source data from GPU1 and writing to GPU2...
Run kernel on GPU1, taking source data from GPU2 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Disabling peer access...
Shutting down...
Test passed

cbuchner1 · April 3, 2018, 2:35pm

what does the deviceQuery sample output for your P600?

It seems that the simpleP2P test program was unable to detect compute 2.0 or higher CUDA capability. So maybe there is something wrong with the P600 CUDA support in this driver?

enfiskutensykkel · April 3, 2018, 2:40pm

I’ve copied out only for one P600, but the other one is identical:

Device 0: "Quadro P600"
  CUDA Driver Version / Runtime Version          9.1 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 2000 MBytes (2097479680 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1557 MHz (1.56 GHz)
  Memory Clock rate:                             2005 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 12 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

cbuchner1 · April 3, 2018, 2:45pm

Looks nominal to me.

Your CUDA runtime version is 8.0. Have you considered upgrading to CUDA toolkit 9.1?

enfiskutensykkel · April 3, 2018, 2:55pm

Sorry, my mistake. I have both versions installed.

Here’s the full output using the correct version. I’ve tried moving the GPUs around a bit, that’s why they have different BDFs.
As you can see, the P600s still report that they can’t access each other using P2P.

[root@metty deviceQuery]# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 3 CUDA Capable device(s)

Device 0: "Quadro P600"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 2000 MBytes (2097479680 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1557 MHz (1.56 GHz)
  Memory Clock rate:                             2005 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro K420"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2000 MBytes (2096693248 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Max Clock rate:                            876 MHz (0.88 GHz)
  Memory Clock rate:                             891 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 5 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Quadro P600"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 2000 MBytes (2097479680 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1557 MHz (1.56 GHz)
  Memory Clock rate:                             2005 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 12 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU1) : No
> Peer access from Quadro P600 (GPU0) -> Quadro P600 (GPU2) : No
> Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU0) : No
> Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU2) : No
> Peer access from Quadro P600 (GPU2) -> Quadro P600 (GPU0) : No
> Peer access from Quadro P600 (GPU2) -> Quadro K420 (GPU1) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 3
Result = PASS

cbuchner1 · April 3, 2018, 3:38pm

Let’s wait for the experts to chime in…

enfiskutensykkel · April 5, 2018, 10:41am

Hi again,

Based on the output from nvidia-smi, it would appear that GNS means that P2P does not work for the GPU at all and NS means that P2P works, but not between those GPUs.

[root@metty ~]# nvidia-smi topo -p2p w
 	GPU0	GPU1	GPU2	GPU3
 GPU0	X	OK	NS	NS
 GPU1	OK	X	NS	NS
 GPU2	NS	NS	X	GNS
 GPU3	NS	NS	GNS	X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

According to the documentation for NVIDIA samples [1], P2P should generally be expected to work for similar GPUs, but the phrasing is a bit unclear:

In general, P2P is supported between two same GPUs with some exceptions, such as some Tesla and Quadro GPUs.

Does that mean that

P2P (generally) works for similar GeForce GPUs, but maybe not for some Quadros and Teslas?

or, that P2P (generally) works for similar GPUs, and some Quadros and Teslas in addition support some dissimilar GPUs?

I looked around in the specs for the Pascal Quadros, and it appears that P2P may actually only be supported for higher-end Quadros:

The P4000 and "above" explicitly list GPDirect as a feature [2].

P600, however, does not list GPUDirect as one of its features [3].

I guess this means that there is no hope in trying to get P2P to work for the P600s, which I must admit is quite disappointing.

[1] http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-peer-to-peer-transfers-with-multi-gpu
[2] https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/documents/Quadro-P4000-US-03Feb17.pdf
[3] https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/documents/Quadro-P600-US-03Feb17.pdf

Here is some additional info about the setup, for anyone stumbling across this thread:

[root@metty ~]# nvidia-smi
Thu Apr  5 12:12:40 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K420         Off  | 00000000:05:00.0 Off |                  N/A |
| 25%   50C    P0    N/A /  N/A |      0MiB /  1999MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K420         Off  | 00000000:06:00.0 Off |                  N/A |
| 26%   52C    P0    N/A /  N/A |      0MiB /  1999MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro P600         Off  | 00000000:09:00.0 Off |                  N/A |
| 34%   48C    P0    N/A /  N/A |      0MiB /  2000MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Quadro P600         Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   65C    P0    N/A /  N/A |      0MiB /  2000MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[root@metty ~]# nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	CPU Affinity
GPU0	 X 	PIX	PHB	PHB	0-5
GPU1	PIX	 X 	PHB	PHB	0-5
GPU2	PHB	PHB	 X 	PIX	0-5
GPU3	PHB	PHB	PIX	 X 	0-5

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

EDIT: The kernel I’m running has some changes to the DMA AP for PCIe peer to peer, so I also attempted to boot an older kernel (3.10.0) but the result was the same: The K420s are able to do P2P, while the P600s are not.

In addition, I’ve also tried two GTX 750s, they report themselves as GNS.

njuffa · April 5, 2018, 7:12pm

As someone who has used a number of low-end Quadros from the Fermi through the Pascal generations this strikes me as a correct assessment.

The unfortunate part is that NVIDIA has (to my knowledge) never provided a handy table that shows which Quadro models their high-end features are limited to. One either has to dig through various online specifications, or find out by trying with the actual hardware, as you have done here.

The K40s you used previously are clearly highest-end GPUs from the Kepler family, so it is not surprising that these high-priced cards come with all the bells and whistles NVIDIA has to offer.