cudaMemcpyPeer fails with error 11 (invalid argument)

robosmith · October 17, 2014, 1:10am

I’m writing mex functions which attempt to copy gpuArrays from one gpu (Titan) to another (Titan Black). cudaMemcpyPeer getLastError returns success when copying from gpu 0 to gpu 0, but getLastError returns “invalid argument” from gpu 1 to gpu 0. Both memory pointers and device indicies and the buffer size appear to be valid.

From what I have read, it seems this should work. Not so?

Am I missing an initialize command to enable this?

Robert_Crovella · October 17, 2014, 1:49pm

You may be missing some necessary setup code like cudaCanAccessPeer and cudaEnablePeerAccess

review the documentation for peer access:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

and the sample code:

[url]CUDA Samples :: CUDA Toolkit Documentation

If peer access does not work for you due to system issues (CanAccessPeer will indicate), you should be able to do an ordinary device-to-device transfer.

robosmith · October 17, 2014, 3:55pm

Thanks for responding txbob. I previously tried the cudaCanAccessPeer (false) and cudaEnablePeerAccess (failed: device unsupported).

From what I read, those are for UVA access. No?

The docs I read said if UVA access is not available, you have to use cudaMemcpyPeer, which I took to mean: cudaMemcpyPeer is always available. No?

Does cudaEnablePeerAccess have to be called to use cudaMemcpyPeer?

I may be confused, but I’m under the impression that UVA access only works with TCC drivers (in Windows). True?

Since cudaMemcpyPeer is failing, it appears my only option is to copy to host and back manually. Any other possibilities?

Robert_Crovella · October 17, 2014, 5:31pm

On windows, UVA requires 64-bit and TCC.

cudaMemcpyPeer can be used when peer access is not enabled. (It also should not be dependent on UVA, AFAIK). In this case, a fallback copy scheme should be used under the hood, involving staging the copy through a temporary buffer in system memory. I have just tested this in a linux environment with multiple GPUs where P2P is not enabled (and not possible). However at the moment I’m unable to run the same test on a windows system with multiple GPUs, (as my office/lab environment is undergoing construction right now).

robosmith · October 17, 2014, 6:44pm

Thanks for your feedback txbob.

Today I tried cudaMemcpy to the host and back.

In the mex function, it seems Matlab prevents access to gpu 1 memory when gpu 0 is the default, cause the cudaMemcpy fails to copy from gpu 1 to the host, even though cudaSetDevice(1) succeeds.

robosmith · October 17, 2014, 9:22pm

I can copy a buffer cudaMalloc’ed by a Matlab process on a different gpu, but cannot copy a gpuArray created by a different Matlab process.

Could there be some sort of process specific memory access protection imposed by Matlab?

ETA: Mathworks says that it is imposed by CUDA, not by them.

Does anyone know if GPUDirect with UVA works in this case?

robosmith · October 21, 2014, 6:59pm

Mathworks says that that inter process access protection is imposed by CUDA, not by them.

Does anyone know if GPUDirect with UVA works in this case?

njuffa · October 21, 2014, 7:58pm

I admit that I don’t quite understand your exact scenario or what you are trying to accomplish, but generally one of the things operating systems enforce is that each process only accesses the memory it owns, but not the memory of another process. Threads within a process can share the memory owned by that process.

robosmith · October 21, 2014, 8:19pm

I am trying to consolidate the results of several Matlab processes, each working on a separate gpu.

Mechanisms like named shared memory in Windows allow different processes to read/write the same memory.

I’m just wondering if anything like that is available for gpu memory in CUDA.

If I understand you correctly, GPUDirect & UVA only works with memory allocated by the same process, correct?

njuffa · October 21, 2014, 8:42pm

Sorry, I have no experience with GPUdirect or UVA.

Robert_Crovella · October 21, 2014, 9:37pm

GPUDirect v2.0/Peer-to-Peer in its ordinary usage requires pointers from the same process. Pointers created in a given process have no relevance in another process. All processes use a virtual address space. The virtual address space of one process is not in any way synchronized with the virtual address space of another process.

You could investigate the CUDA IPC API:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

and sample code:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc[/url]

robosmith · October 22, 2014, 12:11am

Thanks for the pointer txbob. Unfortunately I’m stuck with Windows and those functions seem to require Linux.

Topic		Replies	Views
Peer to peer (UVA) memcpy not working CUDA Programming and Performance cuda	1	36	November 15, 2024
problem with cudaMemcpy3DPeer CUDA Programming and Performance	2	2613	October 19, 2011
problem with cudaMemcpyPeer() - won't do copying CUDA Programming and Performance	0	1443	February 16, 2012
cuMemcpyDtoDAsync between virtual memory allocated memory, no peer access CUDA Programming and Performance cuda	0	18	January 23, 2025
multi-GPU Peer to Peer access CUDA SDK example not working, why? CUDA Programming and Performance	13	5164	February 26, 2015
peer-to-peer copy using cuMemcpy rather than cuMemcpyPeer CUDA Programming and Performance	1	2111	August 9, 2011
Accessing Managed Memory During Asynchronous Copies CUDA Programming and Performance	4	440	March 4, 2024
openMP+CUDA, need help! CUDA Programming and Performance	7	2006	November 23, 2012
Transparent inter-GPU memory migration CUDA Programming and Performance cuda	4	323	December 14, 2023
CUDA_ERROR_ILLEGAL_ADDRESS CUDA Programming and Performance	6	11034	September 26, 2017

cudaMemcpyPeer fails with error 11 (invalid argument)

Related topics