Using cuMemcpy2DAsync and CUDA arrays

shawkie · May 9, 2008, 12:42pm

According to the user guide it is not possible to perform host → device copies to at the same time as kernel execution if the transfer involves CUDA arrays. Are there any known workarounds? Is it likely to be supported in the future? I really need to keep the bilinear filtering and texture caching.

paulius · May 9, 2008, 6:31pm

No. A kernel cannot be overlapped with a memcopy involving a cudaArray. cudaMemcpyToArrayAsync does let the CPU thread continue working while the copy is taking place.

Paulius

shawkie · May 10, 2008, 9:28am

Its not possible to do a host → device copy using linear memory and then a device → device copy into a CUDA array then?

paulius · May 12, 2008, 3:11am

It is. But device->device memcopy of any kind cannot be overlapped with a kernel execution, though it’s much faster than host->device copy due to higher bandwidth.

Paulius

coggrc · July 20, 2009, 5:57pm

From the documentation describing cudaMemcpyToArrayAsync():

“The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and stream is non-zero, the copy may overlap with operations in other streams.”

I would interpret this to mean that a kernel CAN be overlapped with the cudaArray memcopy, provided that the kernel is operating in a different stream. Can you please confirm? Or will the kernel and memcopy ultimately be serialized on the device? If that is the case, it is unclear to me what is meant by “the copy may overlap with operations in other streams”.

Please help clarify, I am very interested in this.

Thanks.

paulius · July 23, 2009, 6:36pm

The documentation at the link you include states: “IMPORTANT NOTE: Copies with kind == cudaMemcpyDeviceToDevice are asynchronous with respect to the host, but never overlap with kernel execution.” The wording regarding overlap with operations in other streams isn’t quite correct - you cannot overlap a kernel execution and a copy to an array. Thanks for catching it.

Paulius

coggrc · July 24, 2009, 9:22pm

I was performing the copy with kind == cudaMemcpyHostToDevice, not cudaMemcpyDeviceToDevice, so it was not clear to me that the note above applied to me. Am I somehow triggering a device → device mem copy that I am unaware of, even though I specify cudaMemcpyHostToDevice?

My desire, ultimately, is to bind to a 2D texture and use interpolation filter. The latest CUDA version (2.2) allows 2D textures mapped to pitch linear memory. Can I use pitch linear memory instead of CUDA array to get around this async overlapping limitation?

nwilt · July 27, 2009, 4:16pm

Yes, this will work, but you can’t update the texture via a streamed 2D host->device memcpy - that will break concurrency. Instead, you have to do it with 1D async memcpy(s), with a bandwidth/CPU overhead tradeoff. You can update the whole texture with a single async copy of Pitch*Height bytes, transferring transfer (Pitch-WidthInBytes)*Height more bytes than is necessary. But if you break the copy into Height memcpy’s of size WidthInBytes, the CPU overhead of all those tiny memcpy’s may overwhelm the bandwidth advantage.

coggrc · July 29, 2009, 4:16am

i plan to use two streams, with each kernel operating on their own texture. i would like to perform one host->device copy (for texture#1) overlapped with kernel execution on the other stream, using another texture (texture#2). so there should be no concurrency problem, right?

i just want to overlap host->device memory copy (bound to a texture) with kernel execution in another stream to achieve the best timing possible. unfortunately it seems copying a cudaArray is out of the question, but perhaps using pitch linear memory is just as good.

only problem is, with my first attempt, mapping PL mem mapped to a 2D texture shows a tremendous penalty for tex2D fetch calls… takes 100X longer. i don’t believe it, so i must have a bug somewhere.