P2P between two Tesla K40c devices

A_CUDA_Enthusiast · May 7, 2020, 8:01pm

I am having difficulty establishing a P2P connection between two Tesla K40c devices on the same PCI-e link and I really do not know what the problem could be. Here is some info regarding my computer:

OS: Window 10 x64
CPU: 2 x Intel Xeon Gold 6136 @ 3.0 GHz
RAM: 392 GB
Motherboard: HP 81C7
BIOS version: v02.47 (up to date as of 7/May/2020)
NVidia driver version: 441.22
CUDA version: Release 10.2 (V10.2.89)
GPU: 3 x GPUs - see below

Here is the DevieQuery result:

Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 21 / 0

Device 1: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 45 / 0

Device 2: "Quadro P400"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         WDDM
Device PCI Domain ID / Bus ID / location ID:   0 / 153 / 0

The Quadro is used for display purposes and is made invisible to my CUDA programs via set CUDA_VISIBLE_DEVICES=0,1. When I run the SimpleP2P example, the example runs successfully but the results indicate that there is a P2P issue here. Specifically, the memcpy speed shows only about 0.2 GB/s despite the fact that the Tesla devices are connected to two PCIe3 x16 CPU0 slots:

Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.19GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

In fact, I realized that P2P may completely fail depending on which device is active when running the kernel to test P2P. So, the problem is not that P2P is slow, but seems to be a more fundamental issue. When I test the codes that I am using for testing on another machine with Ubuntu and two Titan Blacks on CUDA 10.1 it runs just fine that verifies that the test codes are fine. I have posted this question in the past on StackOverflow and I was advised that it likely is a system or platform issue. I have reinstalled the driver and CUDA several times and have tried CUDA 10.1 too but nothing could change this behavior. Can someone shed some light on how I can dig into the problem more and find the source of the problem?

Thank you for your time!

wm18822827507 · July 14, 2020, 1:12am

Have you given up?

A_CUDA_Enthusiast · July 14, 2020, 1:26pm

Never! I raised the issue to the HP support team to investigate whether there is an issue with my motherboard. They sent me a list of Nvidia devices that are supported under the current Bios version and there was no K40 in the list. They mentioned that does not mean that the Tesla K40 should not work with this motherboard and Bios version, but it rather means that K40 has simply not been tested with the latest Bios version by HP. So, I looked into the documentation and realized that K40 was actually officially supported with this mother board and one of the older Bios versions, so I downgraded my Bios and consequently downgraded my Nvidia driver and CUDA too. I even had to downgrade my Visual Studio to support that version of CUDA. Anyway, after all these downgrades I thought I had a system that should support the two K40’s. So, I ran the P2P tests and booooom . . . . . . . . . . . . . . It did not work again!!! At that very moment, I opened the case, pulled out the K40’s and put them on a table and have been shouting at them ever since every morning when I wake up! ;)

Topic		Replies	Views
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2299	October 12, 2021
CUDA peer to peer example ./simpleP2P failing CUDA Programming and Performance	11	8568	February 5, 2015
multi-GPU Peer to Peer access CUDA SDK example not working, why? CUDA Programming and Performance	13	5132	February 26, 2015
Problem with "Simple Peer-to-Peer Transfers with Multi-GPU" I got an exception when I run th CUDA Programming and Performance	1	1628	November 28, 2011
P2P access not enabled, is this a software or a hardware issue? CUDA Setup and Installation	7	9499	November 10, 2015
Peer-to-Peer Communication CUDA Programming and Performance	1	2437	October 13, 2014
Peer-to-Peer Access Fails between 2 GPUs CUDA Setup and Installation	3	5625	July 7, 2017
Peer-to-peer transfer failing on GeForce GTX Titan Z CUDA Programming and Performance	17	3806	April 21, 2015
cuda p2p access not working for multiple k80s CUDA Programming and Performance	0	497	July 8, 2016
K80 peer-to-peer transfers: Slow bandwidth and high latency. CUDA Programming and Performance	7	3673	August 31, 2016

P2P between two Tesla K40c devices

Related topics