Peer-to-peer transfer failing on GeForce GTX Titan Z

jvincent · December 17, 2014, 12:48pm

I am attempting to run the simpleP2P example on a GeForce GTX Titan Z dual GPU card. To summarize, both GPUs support peer-to-peer and UVA. What I am finding is that the example takes a very long time to run and fails the last verification of results test (simpleP2P returns an array of NaNs rather than the correct result). I have included the output below:

[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = “GeForce GTX TITAN Z” IS capable of Peer-to-Peer (P2P)
GPU1 = “GeForce GTX TITAN Z” IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer-to-Peer (P2P) access from GeForce GTX TITAN Z (GPU0) → GeForce GTX TITAN Z (GPU1) : Yes
Peer-to-Peer (P2P) access from GeForce GTX TITAN Z (GPU1) → GeForce GTX TITAN Z (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
GeForce GTX TITAN Z (GPU0) supports UVA: Yes
GeForce GTX TITAN Z (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.05GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Enabling peer access…
Shutting down…
Test failed!

Robert_Crovella · December 17, 2014, 3:42pm

Which linux distro are you using?

jvincent · December 17, 2014, 4:53pm

Linux version 3.13.0-32-generic (buildd@kissel) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014

njuffa · December 17, 2014, 4:58pm

This seems to indicate Ubuntu 14.04. To be sure, could you show the output of the following command:

lsb_release -a

I assume this is a 64-bit platform? The output from uname -a you show above seems to be truncated, as I recall there should be an architecure specification after the UTC date/time stamp. I am not entirely sure.

jvincent · December 17, 2014, 5:08pm

Here is the output of lsb_release -a:

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty

and here is the output of uname -a:

Linux funl-guava 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

njuffa · December 17, 2014, 5:10pm

According to the Linux Getting Started Guide [url]http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/[/url], 64-bit Ubuntu 14.04 is supported by CUDA 6.5. Are you running CUDA 6.5 (final release)?

If so, I have no idea why this sample app does not work for you. I will point out that I have never used a Titan Z, so would not know if there are any gotchas to be aware of.

jvincent · December 17, 2014, 5:32pm

I’m running release 6.5, V6.5.12

Thanks for the help

njuffa · December 17, 2014, 9:34pm

You may want to consider filing a bug report with NVIDIA, the bug reporting form is linked from the registered developer website, [url]https://developer.nvidia.com/[/url]

Nikita-14 · February 20, 2015, 7:02pm

GPU0 = “GeForce GTX TITAN Z” IS capable of Peer-to-Peer (P2P)

Can somebody enlighten me why MY Titan Z NOT capabe of Peer-To-Peer, running the same test?

XNTMAX · April 5, 2015, 4:47pm

Have any one found solution to this problem ?

I have two K40m on a Chassis with a PLX switch.
Have already tried everything. Including switching them to a different hardware, use different OS, change various BIOS settings, change the driver mode, change ECC mode.
NO LUCK what so ever.

sBc-Random · April 6, 2015, 1:33pm

I posted about a similar thing a few months back. Though, I’m not sure if it’s EXACTLY the same thing, as I wasn’t getting a NaN, but rather there wasn’t a copy occuring at all. From memory, once the transfer size reached a certain threshold (say, once you try to transfer more than 512 bytes), everything seemed to work fine.

I was running Ubuntu, and found the error in 5.5, 6.0 and POSSIBLY 6.5

This is all from memory so I might make a mistake or two here…
https://devtalk.nvidia.com/default/topic/795526/issues-with-multithreaded-peertopeer-copies/#4390656

sBc-Random · April 6, 2015, 1:38pm

Followup:

Try and use cublasScopy and see if the issue resolves. If it does I believe we have the same issue…

sBc-Random · April 6, 2015, 1:55pm

Removed. sorry

njuffa · April 6, 2015, 3:03pm

I can only re-iterate my recommendation from #8: If these issues are reproducible with the latest release drivers, I would suggest filing bugs with NVIDIA, using the reporting form linked from the registered developer website.

XNTMAX · April 6, 2015, 4:14pm

Let me clarify it better. I am just running the sample apps from Nvidia.

This occurs on CUDA 6.5 and 7.0.
This occurs on CentOS 6.6 and Ubuntu 14.04
This occurs on two identical machines.
The GPU`s are verified to be healthy and OK. (Ran fielddiag on them)

Here is what happens when i run simpleP2P.
[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = " Tesla K40m" IS capable of Peer-to-Peer (P2P)
GPU1 = " Tesla K40m" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer-to-Peer (P2P) access from Tesla K40m (GPU0) → Tesla K40m (GPU1) : Yes
Peer-to-Peer (P2P) access from Tesla K40m (GPU1) → Tesla K40m (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
Tesla K40m (GPU0) supports UVA: Yes
Tesla K40m (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.12GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Enabling peer access…
Shutting down…
Test failed!

Also if i run nvidia-smi, one of the GPUs shows 97% utilization. This occurs while none of the K40s are set as Display GPU. If i switch the driver mode to persistence, the utilization goes away. (this could be absolutely unrelated.)

Robert_Crovella · April 6, 2015, 6:27pm

What sort of system are these 2 K40m GPUs plugged into?

XNTMAX · April 7, 2015, 12:53am

It is SMC 2028. These two GPUs are sitting on a single PCIe switch.

So found a solution to this issue after going back and forward with numerous parties. It was a goose chase for a while. This issue occurs due to the configuration of the PCIe switch.
This misconfiguration affects the P2P communication. The solution is to configure the PCIe switch to properly allow Peer to Peer communication otherwise, the GPU`s are working fine individually but when communicating with each other it does not work well any more.
So if you are facing the same issue, pick up the phone and call your Motherboard manufacturer or look for the latest BIOS update. If you are using PCIe extension boards, make sure that the latest firmware is applied to the PCIe switch chipset.

Robert_Crovella · April 21, 2015, 7:25pm

For all reports of P2P transfer issues in this thread, try updating your motherboard to the latest BIOS version.

Topic		Replies	Views
Confused about GTX Titan Z Peer-To-Peer (P2) capability CUDA Programming and Performance	19	5077	February 23, 2015
multi-GPU Peer to Peer access CUDA SDK example not working, why? CUDA Programming and Performance	13	5161	February 26, 2015
simpleP2P fails on 8*L40S server CUDA Programming and Performance cuda	1	600	January 22, 2024
CUDA peer to peer example ./simpleP2P failing CUDA Programming and Performance	11	8616	February 5, 2015
SimpleP2P failed using Tesla K80, Windows server 2012 R2, HP DL388 CUDA Programming and Performance	7	1056	January 6, 2018
Problem with "Simple Peer-to-Peer Transfers with Multi-GPU" I got an exception when I run th CUDA Programming and Performance	1	1635	November 28, 2011
P2P access Ada GPUs with PCIe switch CUDA Programming and Performance	8	92	April 28, 2025
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5086	November 27, 2018
P2P not working for P600s? CUDA Programming and Performance	7	1802	April 5, 2018
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2374	October 12, 2021

Peer-to-peer transfer failing on GeForce GTX Titan Z

Related topics