Can I use Quadro K4000 and K2000 with GPUDirect v2 Peer-to-peer (P2P) communictation?

Can I use Quadro K4000 and K2000 for GPUDirect v2 Peer-to-peer (P2P) communication?

I use:

  • Single CPU (Intel Core i7-4820K Ivy Bridge-E) 40 Lanes of PCIe 3.0 + MotherBoard MSI X79A-GD65 (8D)
  • WindowsServer 2012, MSVS 2012 + CUDA 5.5 and compiled as 64-bit application
  • GPUs nVidia Quadro K4000 and K2000
  • All Quadros in TCC-mode (Tesla Compute Cluster)
  • nVidia Video Driver 332.50
  • Switch off VT-d in BIOS

simpleP2P-test shown that, all Quadros K4000 and K4000 - IS capable of Peer-to-Peer (P2P), but Peer-to-Peer (P2P) access - Quadro K4000 (GPU0) <-> Quadro K2000 (GPU1) : No.

[C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5

[C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\0_Simple\simpleP2P…/…/bi n/win64/Release/simpleP2P.exe] - Starting… Checking for multiple GPUs… CUDA-capable device count: 3

GPU0 = " Quadro K4000" IS capable of Peer-to-Peer (P2P)
GPU1 = " Quadro K2000" IS capable of Peer-to-Peer (P2P)
GPU2 = " GeForce GT 640" NOT capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…
Peer-to-Peer (P2P) access from Quadro K4000 (GPU0) → Quadro K2000 (GPU1) : No
Peer-to-Peer (P2P) access from Quadro K2000 (GPU1) → Quadro K4000 (GPU0) : No

Two or more SM 2.0 class GPUs are required for C:\ProgramData\NVIDIA Corporation \CUDA Samples\v5.5\0_Simple\simpleP2P…/…/bin/win64/Release/simpleP2P.exe to r un. Support for UVA requires a GPU with SM 2.0 capabilities. Peer to Peer access is not available between GPU0 <-> GPU1, waiving test.

_Simple\simpleP2P../../bi n/win64/Release/simpleP2P.exe] - Starting... Checking for multiple GPUs... CUDA-capable device count: 3

GPU0 = " Quadro K4000" IS capable of Peer-to-Peer (P2P)
GPU1 = " Quadro K2000" IS capable of Peer-to-Peer (P2P)
GPU2 = " GeForce GT 640" NOT capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
Peer-to-Peer (P2P) access from Quadro K4000 (GPU0) -> Quadro K2000 (GPU1) : No
Peer-to-Peer (P2P) access from Quadro K2000 (GPU1) -> Quadro K4000 (GPU0) : No

Two or more SM 2.0 class GPUs are required for C:\ProgramData\NVIDIA Corporation \CUDA Samples\v5.5

[C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\0_Simple\simpleP2P…/…/bi n/win64/Release/simpleP2P.exe] - Starting… Checking for multiple GPUs… CUDA-capable device count: 3

GPU0 = " Quadro K4000" IS capable of Peer-to-Peer (P2P)
GPU1 = " Quadro K2000" IS capable of Peer-to-Peer (P2P)
GPU2 = " GeForce GT 640" NOT capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…
Peer-to-Peer (P2P) access from Quadro K4000 (GPU0) → Quadro K2000 (GPU1) : No
Peer-to-Peer (P2P) access from Quadro K2000 (GPU1) → Quadro K4000 (GPU0) : No

Two or more SM 2.0 class GPUs are required for C:\ProgramData\NVIDIA Corporation \CUDA Samples\v5.5\0_Simple\simpleP2P…/…/bin/win64/Release/simpleP2P.exe to r un. Support for UVA requires a GPU with SM 2.0 capabilities. Peer to Peer access is not available between GPU0 <-> GPU1, waiving test.

_Simple\simpleP2P../../bin/win64/Release/simpleP2P.exe to r un. Support for UVA requires a GPU with SM 2.0 capabilities. Peer to Peer access is not available between GPU0 <-> GPU1, waiving test.

Quadros in TCC-mode:

nvidia-smi.exe"
Tue Mar 11 12:43:05 2014
+------------------------------------------------------+
| NVIDIA-SMI 5.320.57   Driver Version: 320.57         |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K2000        TCC  | 0000:01:00.0     Off |                  N/A |
| 30%   30C    P8    N/A /  N/A |        6MB /  2047MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 640     WDDM  | 0000:02:00.0     N/A |                  N/A |
| 40%   32C  N/A     N/A /  N/A |     2016MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro K4000        TCC  | 0000:03:00.0     Off |                  N/A |
| 30%   36C    P8    10W /  87W |        8MB /  3071MB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    1            Not Supported                                               |

In the documentation said that: https://developer.nvidia.com/gpudirect
GPUDirect eliminates unnecessary system memory copies, dramatically lowers CPU overhead, and reduces latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA Tesla™ and Quadro™ products.”

More detailed specifications of Quadros there, but there are only about GPUDirect For Video, and nothing about P2P: http://www.nvidia.com/content/PDF/line_card/6660-nv-prographicssolutions-linecard-july13-final-lr.pdf

About PCIe bus:

nvidia-smi -q
GPU 0000:01:00.0
    Product Name                    : Quadro K2000
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x0FFE10DE
        Bus Id                      : 0000:01:00.0
        Sub System Id               : 0x094C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 8x
    FB Memory Usage
        Total                       : 2047 MiB
        Used                        : 6 MiB
        Free                        : 2041 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
...

GPU 0000:02:00.0
    Product Name                    : GeForce GT 640
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x0FC110DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x8A921462
        GPU Link Info
            PCIe Generation
                Max                 : N/A
                Current             : N/A
            Link Width
                Max                 : N/A
                Current             : N/A

...

GPU 0000:03:00.0
    Product Name                    : Quadro K4000
    PCI
        Bus                         : 0x03
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x11FA10DE
        Bus Id                      : 0000:03:00.0
        Sub System Id               : 0x097C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
    FB Memory Usage
        Total                       : 3071 MiB
        Used                        : 8 MiB
        Free                        : 3063 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default

Can I use GPUDirect v2 P2P with Quadros K2000/K4000, and if I can, then with which of these?
I use single CPU, so there should be one IOH with single PCIe-tree, isn’t it?

  1. I can't use P2P Direct Transfers - I transfered random generated data by using cudaMemcpy(gpu_ptr1, gpu_ptr0, cudaMemcpyDefault); successfully with 3 GB/sec on PCIe-gen2 8x (4 GB/sec theoretically), but function copies through the host - In VisualProfiler Context1(DtoH) and Context2(HtoD).
  2. I can't use P2P Direct Access by using __global__ Kernel(char *dst, char *src, size_t size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; dst[idx] = src[idx]; } - I get an error when use function cudaDeviceEnablePeerAccess() and get 0 when using cudaDeviceCanAccessPeer()