Could it be that I cannot use Memory allocated with cudaMalloc3D and a cudaPitchedPtr when accessing memory on another card (Peer-to-peer access within a kernel) ?
If the answer is that it should work, i can post some code to reproduce the problem.
I have a kernel that works perfectly when both memory areas rside on the same card but give an “unspecified launch failure” as soon as the memories reside on different cards.
Windows Server 2008 R2
4 Tesla 2050 cards
Peer Access checked and enabled.
TCC driver version 276.14
Toolkit version 4.0.17 win64
I checked with our driver team and have been advised that this should work provided cudaDeviceEnablePeerAccess() was called to enable P2P. I assume you are religiously checking the return status of all CUDA API calls leading up to the failing kernel call? If after careful review of the code you have concluded that this is likely a CUDA issue, I would suggest filing a bug, attaching a self-contained repro case (smaller is better).
It seems I figured out part of the problem which might also be a driver problem.
Actually I did not have Peer access enabled between all devices (which explains the error).
The reason however was that driver version 270.81 correctly(!) reported that peer access was only possible between GPU0 <-> GPU1 and GPU2<->GPU3.
After updating the driver to 276.14 it now falsely(!) reports full peer access being possible.
Enabling peer access across the 2 GPU clusters (0,1 and 2,3) however and trying CudaMemcpyPeer or direct access now results in a system crash which requires a reboot.
It’s special system configuration we bought before the NVIDIA 1U solutions were available (sigh).
This might also be related to another problem I had with CudaMemcpy3DPeer when peer access was not enabled (see [topic=‘213131’]problem with cudaMemcpy3DPeer[/topic]).
P2P works only for GPUs under the same PCI-e root. In Linux, you could see the PCI-e tree with “lspci -t”, I am not familiar with Windows but there should be something equivalent.