I’m writing mex functions which attempt to copy gpuArrays from one gpu (Titan) to another (Titan Black). cudaMemcpyPeer getLastError returns success when copying from gpu 0 to gpu 0, but getLastError returns “invalid argument” from gpu 1 to gpu 0. Both memory pointers and device indicies and the buffer size appear to be valid.
From what I have read, it seems this should work. Not so?
Am I missing an initialize command to enable this?
If peer access does not work for you due to system issues (CanAccessPeer will indicate), you should be able to do an ordinary device-to-device transfer.
cudaMemcpyPeer can be used when peer access is not enabled. (It also should not be dependent on UVA, AFAIK). In this case, a fallback copy scheme should be used under the hood, involving staging the copy through a temporary buffer in system memory. I have just tested this in a linux environment with multiple GPUs where P2P is not enabled (and not possible). However at the moment I’m unable to run the same test on a windows system with multiple GPUs, (as my office/lab environment is undergoing construction right now).
In the mex function, it seems Matlab prevents access to gpu 1 memory when gpu 0 is the default, cause the cudaMemcpy fails to copy from gpu 1 to the host, even though cudaSetDevice(1) succeeds.
I admit that I don’t quite understand your exact scenario or what you are trying to accomplish, but generally one of the things operating systems enforce is that each process only accesses the memory it owns, but not the memory of another process. Threads within a process can share the memory owned by that process.
GPUDirect v2.0/Peer-to-Peer in its ordinary usage requires pointers from the same process. Pointers created in a given process have no relevance in another process. All processes use a virtual address space. The virtual address space of one process is not in any way synchronized with the virtual address space of another process.