P2P between two Tesla K40c devices

I am having difficulty establishing a P2P connection between two Tesla K40c devices on the same PCI-e link and I really do not know what the problem could be. Here is some info regarding my computer:

OS: Window 10 x64
CPU: 2 x Intel Xeon Gold 6136 @ 3.0 GHz
RAM: 392 GB
Motherboard: HP 81C7
BIOS version: v02.47 (up to date as of 7/May/2020)
NVidia driver version: 441.22
CUDA version: Release 10.2 (V10.2.89)
GPU: 3 x GPUs - see below

Here is the DevieQuery result:

Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 21 / 0

Device 1: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 45 / 0

Device 2: "Quadro P400"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         WDDM
Device PCI Domain ID / Bus ID / location ID:   0 / 153 / 0

The Quadro is used for display purposes and is made invisible to my CUDA programs via set CUDA_VISIBLE_DEVICES=0,1. When I run the SimpleP2P example, the example runs successfully but the results indicate that there is a P2P issue here. Specifically, the memcpy speed shows only about 0.2 GB/s despite the fact that the Tesla devices are connected to two PCIe3 x16 CPU0 slots:

Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.19GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

In fact, I realized that P2P may completely fail depending on which device is active when running the kernel to test P2P. So, the problem is not that P2P is slow, but seems to be a more fundamental issue. When I test the codes that I am using for testing on another machine with Ubuntu and two Titan Blacks on CUDA 10.1 it runs just fine that verifies that the test codes are fine. I have posted this question in the past on StackOverflow and I was advised that it likely is a system or platform issue. I have reinstalled the driver and CUDA several times and have tried CUDA 10.1 too but nothing could change this behavior. Can someone shed some light on how I can dig into the problem more and find the source of the problem?

Thank you for your time!

Have you given up?

Never! I raised the issue to the HP support team to investigate whether there is an issue with my motherboard. They sent me a list of Nvidia devices that are supported under the current Bios version and there was no K40 in the list. They mentioned that does not mean that the Tesla K40 should not work with this motherboard and Bios version, but it rather means that K40 has simply not been tested with the latest Bios version by HP. So, I looked into the documentation and realized that K40 was actually officially supported with this mother board and one of the older Bios versions, so I downgraded my Bios and consequently downgraded my Nvidia driver and CUDA too. I even had to downgrade my Visual Studio to support that version of CUDA. Anyway, after all these downgrades I thought I had a system that should support the two K40’s. So, I ran the P2P tests and booooom . . . . . . . . . . . . . . It did not work again!!! At that very moment, I opened the case, pulled out the K40’s and put them on a table and have been shouting at them ever since every morning when I wake up! ;)

1 Like