cudamemcpy2Dasync + stream create stream for 2D array

Hi,
Has anyone tried to implement stream with the cudamemcpy2Dasync?
All the examples seem to be on 1D grid.
Any help would be appreciated, thanks.

Bump…

Has anyone tried this? I’m a bit perplexed by this statement on page 33 in the programming guide (2.1) regarding 2D memory and memcopy overlap:

“This capability is currently supported only for memory copies that do not involve CUDA arrays or 2D arrays allocated through cudaMallocPitch() (see Section 4.5.2.3)…”

Is it really that allocating linear memory via cudaMallocPitch is the problem? Or is it using cudaMemcpy2DAsync? I can easily get around the first problem (it’s not tricky to mimic cudaMallocPitch with cudaMalloc), but I really need the rows in my device arrays to be aligned so that stores can be easily coalesced. So if cudaMemcpy2DAsync is implemented in a way that cannot be overlapped, I may have to reconsider my kernel implementation.

I guess a related question is if these memcopies cannot be overlapped, are there any advantage to embedding them in a stream?

Thanks all,
David

I tried cudaMemcpy2DAsync for both host to device and device to host copy of “pitched buffers” and it works fine.

That is, the device buffer can be allocated with cudaMallocPitch where the width != pitch (i.e. I tried with width = 101) and it works.

The host buffer must be allocated with cudaMallocHost().

Here’s the reference manual on the subject:

cudaMemcpy2DAsync() is asynchronous and can optionally be associated to a stream by passing a non-zero stream argument. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input.

The hardware resource used to implement concurrent kernel/memcpy execution only supports linear memcpy.

So the observation about 2D arrays allocated with cudaMallocPitch() is one of implementation. The driver support could be written, but has not been. It’s on our (long) list of things to do. Meantime, developers can implement it themselves (albeit slightly less efficiently) in terms of consecutive async memcpy calls.

The observation about CUDA arrays is a hardware limitation. Copies involving CUDA arrays cannot be performed concurrently with kernel execution.

You mean cudaMemcpyToArrayAsync will not overlap with kernel execution? (in cuda 2.2)

No, it is not possible:

Programming Guide s.33