I am having difficulty establishing a P2P connection between two Tesla K40c devices on the same PCI-e link and I really do not know what the problem could be. Here is some info regarding my computer:
OS: Window 10 x64
CPU: 2 x Intel Xeon Gold 6136 @ 3.0 GHz
RAM: 392 GB
Motherboard: HP 81C7
BIOS version: v02.47 (up to date as of 7/May/2020)
NVidia driver version: 441.22
CUDA version: Release 10.2 (V10.2.89)
GPU: 3 x GPUs - see below
Here is the DevieQuery result:
Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM): TCC
Device PCI Domain ID / Bus ID / location ID: 0 / 21 / 0
Device 1: "Tesla K40c"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM): TCC
Device PCI Domain ID / Bus ID / location ID: 0 / 45 / 0
Device 2: "Quadro P400"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM): WDDM
Device PCI Domain ID / Bus ID / location ID: 0 / 153 / 0
The Quadro is used for display purposes and is made invisible to my CUDA programs via set CUDA_VISIBLE_DEVICES=0,1
. When I run the SimpleP2P example, the example runs successfully but the results indicate that there is a P2P issue here. Specifically, the memcpy speed shows only about 0.2 GB/s despite the fact that the Tesla devices are connected to two PCIe3 x16 CPU0 slots:
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.19GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
In fact, I realized that P2P may completely fail depending on which device is active when running the kernel to test P2P. So, the problem is not that P2P is slow, but seems to be a more fundamental issue. When I test the codes that I am using for testing on another machine with Ubuntu and two Titan Blacks on CUDA 10.1 it runs just fine that verifies that the test codes are fine. I have posted this question in the past on StackOverflow and I was advised that it likely is a system or platform issue. I have reinstalled the driver and CUDA several times and have tried CUDA 10.1 too but nothing could change this behavior. Can someone shed some light on how I can dig into the problem more and find the source of the problem?
Thank you for your time!