cudamemcpy2Dasync + stream create stream for 2D array

pathfinder02 · September 15, 2008, 3:33pm

Hi,
Has anyone tried to implement stream with the cudamemcpy2Dasync?
All the examples seem to be on 1D grid.
Any help would be appreciated, thanks.

davidk · February 13, 2009, 8:58pm

Bump…

Has anyone tried this? I’m a bit perplexed by this statement on page 33 in the programming guide (2.1) regarding 2D memory and memcopy overlap:

“This capability is currently supported only for memory copies that do not involve CUDA arrays or 2D arrays allocated through cudaMallocPitch() (see Section 4.5.2.3)…”

Is it really that allocating linear memory via cudaMallocPitch is the problem? Or is it using cudaMemcpy2DAsync? I can easily get around the first problem (it’s not tricky to mimic cudaMallocPitch with cudaMalloc), but I really need the rows in my device arrays to be aligned so that stores can be easily coalesced. So if cudaMemcpy2DAsync is implemented in a way that cannot be overlapped, I may have to reconsider my kernel implementation.

I guess a related question is if these memcopies cannot be overlapped, are there any advantage to embedding them in a stream?

Thanks all,
David

Mathieu_Lamarre · May 13, 2009, 9:16pm

I tried cudaMemcpy2DAsync for both host to device and device to host copy of “pitched buffers” and it works fine.

That is, the device buffer can be allocated with cudaMallocPitch where the width != pitch (i.e. I tried with width = 101) and it works.

The host buffer must be allocated with cudaMallocHost().

Here’s the reference manual on the subject:

cudaMemcpy2DAsync() is asynchronous and can optionally be associated to a stream by passing a non-zero stream argument. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input.

nwilt · May 14, 2009, 1:28pm

The hardware resource used to implement concurrent kernel/memcpy execution only supports linear memcpy.

So the observation about 2D arrays allocated with cudaMallocPitch() is one of implementation. The driver support could be written, but has not been. It’s on our (long) list of things to do. Meantime, developers can implement it themselves (albeit slightly less efficiently) in terms of consecutive async memcpy calls.

The observation about CUDA arrays is a hardware limitation. Copies involving CUDA arrays cannot be performed concurrently with kernel execution.

bog · May 26, 2009, 3:48pm

You mean cudaMemcpyToArrayAsync will not overlap with kernel execution? (in cuda 2.2)

Tobi_W · May 27, 2009, 5:04pm

No, it is not possible:

Programming Guide s.33

Topic		Replies	Views
Using cuMemcpy2DAsync and CUDA arrays CUDA Programming and Performance	8	4752	July 29, 2009
2D array with memcopy2D and Kernel usage CUDA Programming and Performance	4	1299	January 19, 2016
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	888	July 7, 2017
help cudaMemcpy2d Trying to modify a 2d array on cuda device CUDA Programming and Performance	8	4977	September 11, 2010
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1422	January 23, 2009
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1752	October 17, 2009
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	483	September 30, 2024
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3152	July 14, 2010
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
asynchronous cuMemcpyDtoD ? CUDA Programming and Performance	9	2409	December 9, 2008

cudamemcpy2Dasync + stream create stream for 2D array

Related topics