multi-GPU Peer to Peer access CUDA SDK example not working, why?

Hello,
I have two Tesla K40c GPUs installed in a workstation running Arch Linux 64-bit. Both GPUs have worked correctly for every GPU program I have launched in the past, all single GPU.
Now in this opportunity I am making a multi-GPU program where in some part I copy an array from GPU0 to GPU1 directly using P2P (cudaMemcpyPeer(…)). The cudaMemcpyPeer does not complain of any error, but when that memory part is being accesed by the GPU the program gets stuck as if was waiting for something. If i comment out the P2P part, and just put arbitrary numbers on that part of memory for the second GPU, the program runs as expected, giving a wrong result obviosly. Just to clarify, multi-GPU programs that do not use Peer to Peer memory transfer do work perfectly, so multi-GPU is not the problem.

At first I thought it was a program-specific mistake I had to solve myself, but when I ran the CUDA simpleP2P test, I came to the surprise that the example gets stuck too. This is the output of the simpleP2P CUDA example.

[user]./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "     Tesla K40c" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "     Tesla K40c" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer-to-Peer (P2P) access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer-to-Peer (P2P) access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla K40c (GPU0) supports UVA: Yes
> Tesla K40c (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...

The program freezes at ‘Creating event handles’. If I launch nvidia-smi at that moment, on another ssh session, I get:

[user]$ nvidia-smi
Wed Dec 10 13:33:43 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.32     Driver Version: 340.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 0000:01:00.0     Off |                  Off |
| 28%   58C    P0    62W / 235W |    162MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 0000:04:00.0     Off |                  Off |
| 29%   60C    P0    64W / 235W |    162MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     11181  ./simpleP2P                                          136MiB |
|    1     11181  ./simpleP2P                                          136MiB |
+-----------------------------------------------------------------------------+

We can see that both GPUs have the program in progress, but neither of them is doing any work.
Finally, this is the last part of strace output, where it hangs. (requested by a member of SO)

15:46:22.352233 write(1, "Checking GPU0 and GPU1 for UVA c"..., 47Checking GPU0 and GPU1 for UVA capabilities...
) = 47 <0.000025>
15:46:22.352295 write(1, "> Tesla K40c (GPU0) supports UVA"..., 38> Tesla K40c (GPU0) supports UVA: Yes
) = 38 <0.000021>
15:46:22.352351 write(1, "> Tesla K40c (GPU1) supports UVA"..., 38> Tesla K40c (GPU1) supports UVA: Yes
) = 38 <0.000022>
15:46:22.352408 write(1, "Both GPUs can support UVA, enabl"..., 39Both GPUs can support UVA, enabling...
) = 39 <0.000021>
15:46:22.352468 write(1, "Allocating buffers (64MB on GPU0"..., 56Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
) = 56 <0.000023>
15:46:22.352548 ioctl(3, 0xc0a0464a, 0x7fff5e713cd0) = 0 <0.000075>
15:46:22.352658 ioctl(3, 0xc0384657, 0x7fff5e713c20) = 0 <0.000109>
15:46:22.352799 ioctl(3, 0xc020462a, 0x7fff5e713c50) = 0 <0.000036>
15:46:22.352895 ioctl(3, 0xc0384657, 0x7fff5e713ab0) = 0 <0.000115>
15:46:22.353059 ioctl(3, 0xc0a0464a, 0x7fff5e713cd0) = 0 <0.000070>
15:46:22.353163 ioctl(3, 0xc0384657, 0x7fff5e713c20) = 0 <0.000114>
15:46:22.353309 ioctl(3, 0xc020462a, 0x7fff5e713c50) = 0 <0.000035>
15:46:22.353389 ioctl(3, 0xc0384657, 0x7fff5e713ab0) = 0 <0.000139>
15:46:22.353577 mmap(0x208a00000, 67108864, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x208a00000 <0.000030>
15:46:22.353641 ioctl(8, 0xc0304627, 0x7fff5e713b20) = 0 <0.032102>
15:46:22.385842 ioctl(3, 0xc0384657, 0x7fff5e713bb0) = 0 <0.001215>
15:46:22.387187 ioctl(3, 0xc01c4634, 0x7fff5e7139f0) = 0 <0.000057>
15:46:22.387319 ioctl(3, 0xc0384657, 0x7fff5e713a40) = 0 <0.001193>
15:46:22.388604 write(1, "Creating event handles...\n", 26Creating event handles...
) = 26 <0.000033>
15:46:22.388949 futex(0x2456f68, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000017>
15:46:22.389019 futex(0x2456f68, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000012>
15:46:22.389684 write(26, "53", 1)    = 1 <0.000035>
15:46:22.389804 futex(0x7fff5e714040, FUTEX_WAIT_PRIVATE, 0, NULL

→ the strace stops right at that last line, no more execution, no more messages.

Any ideas?? Can someone try to run the ./sampleP2P example in Arch Linux using multiple GPUs and see if it works??

The corresponding SO posting:

[url]http://stackoverflow.com/questions/27406327/cuda-p2p-freezes-the-application-why[/url]

which has a few additional tidbits such as details of the HW platform.

Hi txbob,

It was actually me the one asking there!!!, but the answer may require more CUDA related people, that is why I switched to this forum.

This might not be helpful, but I remember reading in the programming guide that with universal addressing, P2P API is not useful and simple cudaMemcpy could/should be used (with cudaMemcpyDefault). Maybe you could try replacing all cudaMemcpyPeer just to see what happens.

FWIW, simpleP2P runs successfully on 2 GTX980s (Ubuntu 14.04, CUDA 6.5)

I will try this today and reply back, thanks

Did you solve this? I have the exact same problem with my GTX 780 6Gb GPUs on Ubuntu 12.04 with CUDA 6.5. Any help would be appreciated!

what is the sample output - ‘error message’ - exactly; in many a case this is helpful

also, note the output of lspci -vb

@little_jimmy: in my case (and in neoideo’s too, I think), there is no error message - the program just doesn’t do anything (and it does not crash); it seems to be waiting for the p2p-copying to finish, which it never does.

As you can see above, p2p-access IS available, bus mastering is enabled (at least on my machine), and everything should be in order… I am using a Gigabyte 990FXA-UD7 motherboard, and although it has all the right specs, I am thinking that that might be the culprit here?

Any help would be greatly appreciated :-)

.

seems that you get past the p2p support tests then, up to the point of actual transfers - something that was not explicitly clear from your previous post

“I am using a Gigabyte 990FXA-UD7 motherboard, and although it has all the right specs, I am thinking that that might be the culprit here?”

if the p2p support tests pass, i would pay considerably less attention to hardware, and start diverting attention to software

perhaps just check that your OS and its version is cuda supported, as well as auxiliary packages like gcc; i am confident that the p2p sample works for all supported and listed OS; the catch being supported and listed
you mention ubuntu 12; others have mentioned that it works for ubuntu 14

if that proves unfruitful, you may very well step the p2p sample in the debugger, and note the specific instruction post the p2p support tests that fails to complete; this too would be insightful

.

Hi anderso,
I could never solve the problem for that computer using Arch Linux 64-bit. My motherboard was Gigabyte 990DXA-UD3, almost the same as yours. Could it be that the motherboard has problems? Maybe a Bios update can solve it or definetly the motherboard has a problem. In the last case, I would blame Arch Linux, but I think is unprobable since those calls are Kernel calls, should be the same for the other officialy supported Linux distributions

In the end, I moved the GPUs to a rackeable server and P2P test works perfect.

Hi neoideo,

Thanks for responding. Yes, it sounds very much like it is a problem with the motherboard, as I suspected. We should both contact Gigabyte’s customer service, and see if they have a fix!

Not the BIOS:
I tried updating to the latest BIOS (beta version), and that did not help either.

Not a Software Problem:
I have physically moved the HD from a similar Linux box (with an MSI 890FXA-GD70 motherboard, and where p2p-copying works) to my machine - and booted from that Ubuntu installation - and I get the EXACT same error.