cudaMemcpyPeer fails with error 11 (invalid argument)

I’m writing mex functions which attempt to copy gpuArrays from one gpu (Titan) to another (Titan Black). cudaMemcpyPeer getLastError returns success when copying from gpu 0 to gpu 0, but getLastError returns “invalid argument” from gpu 1 to gpu 0. Both memory pointers and device indicies and the buffer size appear to be valid.

From what I have read, it seems this should work. Not so?

Am I missing an initialize command to enable this?

You may be missing some necessary setup code like cudaCanAccessPeer and cudaEnablePeerAccess

review the documentation for peer access:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PEER.html#group__CUDART__PEER

and the sample code:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-peer-to-peer-transfers-with-multi-gpu

If peer access does not work for you due to system issues (CanAccessPeer will indicate), you should be able to do an ordinary device-to-device transfer.

Thanks for responding txbob. I previously tried the cudaCanAccessPeer (false) and cudaEnablePeerAccess (failed: device unsupported).

From what I read, those are for UVA access. No?

The docs I read said if UVA access is not available, you have to use cudaMemcpyPeer, which I took to mean: cudaMemcpyPeer is always available. No?

Does cudaEnablePeerAccess have to be called to use cudaMemcpyPeer?

I may be confused, but I’m under the impression that UVA access only works with TCC drivers (in Windows). True?

Since cudaMemcpyPeer is failing, it appears my only option is to copy to host and back manually. Any other possibilities?

On windows, UVA requires 64-bit and TCC.

cudaMemcpyPeer can be used when peer access is not enabled. (It also should not be dependent on UVA, AFAIK). In this case, a fallback copy scheme should be used under the hood, involving staging the copy through a temporary buffer in system memory. I have just tested this in a linux environment with multiple GPUs where P2P is not enabled (and not possible). However at the moment I’m unable to run the same test on a windows system with multiple GPUs, (as my office/lab environment is undergoing construction right now).

Thanks for your feedback txbob.

Today I tried cudaMemcpy to the host and back.

In the mex function, it seems Matlab prevents access to gpu 1 memory when gpu 0 is the default, cause the cudaMemcpy fails to copy from gpu 1 to the host, even though cudaSetDevice(1) succeeds.

I can copy a buffer cudaMalloc’ed by a Matlab process on a different gpu, but cannot copy a gpuArray created by a different Matlab process.

Could there be some sort of process specific memory access protection imposed by Matlab?

ETA: Mathworks says that it is imposed by CUDA, not by them.

Does anyone know if GPUDirect with UVA works in this case?

Mathworks says that that inter process access protection is imposed by CUDA, not by them.

Does anyone know if GPUDirect with UVA works in this case?

I admit that I don’t quite understand your exact scenario or what you are trying to accomplish, but generally one of the things operating systems enforce is that each process only accesses the memory it owns, but not the memory of another process. Threads within a process can share the memory owned by that process.

I am trying to consolidate the results of several Matlab processes, each working on a separate gpu.

Mechanisms like named shared memory in Windows allow different processes to read/write the same memory.

I’m just wondering if anything like that is available for gpu memory in CUDA.

If I understand you correctly, GPUDirect & UVA only works with memory allocated by the same process, correct?

Sorry, I have no experience with GPUdirect or UVA.

GPUDirect v2.0/Peer-to-Peer in its ordinary usage requires pointers from the same process. Pointers created in a given process have no relevance in another process. All processes use a virtual address space. The virtual address space of one process is not in any way synchronized with the virtual address space of another process.

You could investigate the CUDA IPC API:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1ga682d4aea57e8adb6c72330b78900616

and sample code:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc

1 Like

Thanks for the pointer txbob. Unfortunately I’m stuck with Windows and those functions seem to require Linux.