Hello,
I have two Tesla K40c GPUs installed in a workstation running Arch Linux 64-bit. Both GPUs have worked correctly for every GPU program I have launched in the past, all single GPU.
Now in this opportunity I am making a multi-GPU program where in some part I copy an array from GPU0 to GPU1 directly using P2P (cudaMemcpyPeer(…)). The cudaMemcpyPeer does not complain of any error, but when that memory part is being accesed by the GPU the program gets stuck as if was waiting for something. If i comment out the P2P part, and just put arbitrary numbers on that part of memory for the second GPU, the program runs as expected, giving a wrong result obviosly. Just to clarify, multi-GPU programs that do not use Peer to Peer memory transfer do work perfectly, so multi-GPU is not the problem.
At first I thought it was a program-specific mistake I had to solve myself, but when I ran the CUDA simpleP2P test, I came to the surprise that the example gets stuck too. This is the output of the simpleP2P CUDA example.
[user]./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = " Tesla K40c" IS capable of Peer-to-Peer (P2P)
> GPU1 = " Tesla K40c" IS capable of Peer-to-Peer (P2P)
Checking GPU(s) for support of peer to peer memory access...
> Peer-to-Peer (P2P) access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer-to-Peer (P2P) access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla K40c (GPU0) supports UVA: Yes
> Tesla K40c (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
The program freezes at ‘Creating event handles’. If I launch nvidia-smi at that moment, on another ssh session, I get:
[user]$ nvidia-smi
Wed Dec 10 13:33:43 2014
+------------------------------------------------------+
| NVIDIA-SMI 340.32 Driver Version: 340.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 0000:01:00.0 Off | Off |
| 28% 58C P0 62W / 235W | 162MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c Off | 0000:04:00.0 Off | Off |
| 29% 60C P0 64W / 235W | 162MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 11181 ./simpleP2P 136MiB |
| 1 11181 ./simpleP2P 136MiB |
+-----------------------------------------------------------------------------+
We can see that both GPUs have the program in progress, but neither of them is doing any work.
Finally, this is the last part of strace output, where it hangs. (requested by a member of SO)
15:46:22.352233 write(1, "Checking GPU0 and GPU1 for UVA c"..., 47Checking GPU0 and GPU1 for UVA capabilities...
) = 47 <0.000025>
15:46:22.352295 write(1, "> Tesla K40c (GPU0) supports UVA"..., 38> Tesla K40c (GPU0) supports UVA: Yes
) = 38 <0.000021>
15:46:22.352351 write(1, "> Tesla K40c (GPU1) supports UVA"..., 38> Tesla K40c (GPU1) supports UVA: Yes
) = 38 <0.000022>
15:46:22.352408 write(1, "Both GPUs can support UVA, enabl"..., 39Both GPUs can support UVA, enabling...
) = 39 <0.000021>
15:46:22.352468 write(1, "Allocating buffers (64MB on GPU0"..., 56Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
) = 56 <0.000023>
15:46:22.352548 ioctl(3, 0xc0a0464a, 0x7fff5e713cd0) = 0 <0.000075>
15:46:22.352658 ioctl(3, 0xc0384657, 0x7fff5e713c20) = 0 <0.000109>
15:46:22.352799 ioctl(3, 0xc020462a, 0x7fff5e713c50) = 0 <0.000036>
15:46:22.352895 ioctl(3, 0xc0384657, 0x7fff5e713ab0) = 0 <0.000115>
15:46:22.353059 ioctl(3, 0xc0a0464a, 0x7fff5e713cd0) = 0 <0.000070>
15:46:22.353163 ioctl(3, 0xc0384657, 0x7fff5e713c20) = 0 <0.000114>
15:46:22.353309 ioctl(3, 0xc020462a, 0x7fff5e713c50) = 0 <0.000035>
15:46:22.353389 ioctl(3, 0xc0384657, 0x7fff5e713ab0) = 0 <0.000139>
15:46:22.353577 mmap(0x208a00000, 67108864, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x208a00000 <0.000030>
15:46:22.353641 ioctl(8, 0xc0304627, 0x7fff5e713b20) = 0 <0.032102>
15:46:22.385842 ioctl(3, 0xc0384657, 0x7fff5e713bb0) = 0 <0.001215>
15:46:22.387187 ioctl(3, 0xc01c4634, 0x7fff5e7139f0) = 0 <0.000057>
15:46:22.387319 ioctl(3, 0xc0384657, 0x7fff5e713a40) = 0 <0.001193>
15:46:22.388604 write(1, "Creating event handles...\n", 26Creating event handles...
) = 26 <0.000033>
15:46:22.388949 futex(0x2456f68, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000017>
15:46:22.389019 futex(0x2456f68, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000012>
15:46:22.389684 write(26, "53", 1) = 1 <0.000035>
15:46:22.389804 futex(0x7fff5e714040, FUTEX_WAIT_PRIVATE, 0, NULL
→ the strace stops right at that last line, no more execution, no more messages.
Any ideas?? Can someone try to run the ./sampleP2P example in Arch Linux using multiple GPUs and see if it works??