Peer-to-peer transfer failing on GeForce GTX Titan Z

I am attempting to run the simpleP2P example on a GeForce GTX Titan Z dual GPU card. To summarize, both GPUs support peer-to-peer and UVA. What I am finding is that the example takes a very long time to run and fails the last verification of results test (simpleP2P returns an array of NaNs rather than the correct result). I have included the output below:

[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = “GeForce GTX TITAN Z” IS capable of Peer-to-Peer (P2P)
GPU1 = “GeForce GTX TITAN Z” IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer-to-Peer (P2P) access from GeForce GTX TITAN Z (GPU0) -> GeForce GTX TITAN Z (GPU1) : Yes
Peer-to-Peer (P2P) access from GeForce GTX TITAN Z (GPU1) -> GeForce GTX TITAN Z (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
GeForce GTX TITAN Z (GPU0) supports UVA: Yes
GeForce GTX TITAN Z (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.05GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Enabling peer access…
Shutting down…
Test failed!

Which linux distro are you using?

Linux version 3.13.0-32-generic (buildd@kissel) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014

This seems to indicate Ubuntu 14.04. To be sure, could you show the output of the following command:

lsb_release -a

I assume this is a 64-bit platform? The output from uname -a you show above seems to be truncated, as I recall there should be an architecure specification after the UTC date/time stamp. I am not entirely sure.

Here is the output of lsb_release -a:

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty

and here is the output of uname -a:

Linux funl-guava 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

According to the Linux Getting Started Guide http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/, 64-bit Ubuntu 14.04 is supported by CUDA 6.5. Are you running CUDA 6.5 (final release)?

If so, I have no idea why this sample app does not work for you. I will point out that I have never used a Titan Z, so would not know if there are any gotchas to be aware of.

I’m running release 6.5, V6.5.12

Thanks for the help

You may want to consider filing a bug report with NVIDIA, the bug reporting form is linked from the registered developer website, https://developer.nvidia.com/

GPU0 = “GeForce GTX TITAN Z” IS capable of Peer-to-Peer (P2P)

Can somebody enlighten me why MY Titan Z NOT capabe of Peer-To-Peer, running the same test?

Have any one found solution to this problem ?

I have two K40m on a Chassis with a PLX switch.
Have already tried everything. Including switching them to a different hardware, use different OS, change various BIOS settings, change the driver mode, change ECC mode.
NO LUCK what so ever.

I posted about a similar thing a few months back. Though, I’m not sure if it’s EXACTLY the same thing, as I wasn’t getting a NaN, but rather there wasn’t a copy occuring at all. From memory, once the transfer size reached a certain threshold (say, once you try to transfer more than 512 bytes), everything seemed to work fine.

I was running Ubuntu, and found the error in 5.5, 6.0 and POSSIBLY 6.5

This is all from memory so I might make a mistake or two here…
https://devtalk.nvidia.com/default/topic/795526/issues-with-multithreaded-peertopeer-copies/#4390656

Followup:

Try and use cublasScopy and see if the issue resolves. If it does I believe we have the same issue…

Removed. sorry

I can only re-iterate my recommendation from #8: If these issues are reproducible with the latest release drivers, I would suggest filing bugs with NVIDIA, using the reporting form linked from the registered developer website.

Let me clarify it better. I am just running the sample apps from Nvidia.

  • This occurs on CUDA 6.5 and 7.0.
  • This occurs on CentOS 6.6 and Ubuntu 14.04
  • This occurs on two identical machines.
  • The GPU`s are verified to be healthy and OK. (Ran fielddiag on them)

Here is what happens when i run simpleP2P.
[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = " Tesla K40m" IS capable of Peer-to-Peer (P2P)
GPU1 = " Tesla K40m" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer-to-Peer (P2P) access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes
Peer-to-Peer (P2P) access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
Tesla K40m (GPU0) supports UVA: Yes
Tesla K40m (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.12GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Enabling peer access…
Shutting down…
Test failed!

Also if i run nvidia-smi, one of the GPUs shows 97% utilization. This occurs while none of the K40s are set as Display GPU. If i switch the driver mode to persistence, the utilization goes away. (this could be absolutely unrelated.)

What sort of system are these 2 K40m GPUs plugged into?

It is SMC 2028. These two GPUs are sitting on a single PCIe switch.

So found a solution to this issue after going back and forward with numerous parties. It was a goose chase for a while. This issue occurs due to the configuration of the PCIe switch.
This misconfiguration affects the P2P communication. The solution is to configure the PCIe switch to properly allow Peer to Peer communication otherwise, the GPU`s are working fine individually but when communicating with each other it does not work well any more.
So if you are facing the same issue, pick up the phone and call your Motherboard manufacturer or look for the latest BIOS update. If you are using PCIe extension boards, make sure that the latest firmware is applied to the PCIe switch chipset.

For all reports of P2P transfer issues in this thread, try updating your motherboard to the latest BIOS version.