simple asynchronous memcpy and kernel execution question

cudaprogrammer · March 6, 2010, 4:30am

Forgive me if this question has already been answered, but I searched the posts and did not find anything. If you have the general pseudo code of

kernel1<<<…>>>
memcpyasync()
kernel2<<<…>>>

Will the device searialize these calls? I know kernel calls are sequential (at least until Fermi), but in the above example, will the memcopy complete before kernel2 executes? I have many of these operations and I don’t want the host to waste time waiting for blocking calls to return, if I can help it. If the above is serialized, is there a practical limit to the number one could call befor filling up the queue on the device? Is that documented as part of the device or is it a driver issue that one could characterize?

Thanks in advance for your help.

gonnet · March 6, 2010, 3:51pm

Hi,

If you specify a stream in the kernel1 and kernel2 calls, then if the async memcpy is using the same stream, the calls would be serialized. This is not clearly specified in the documentation as far as i know unfortunately.

Now if we consider the case where you do not give a stream when launching the kernels, i don’t know what would be the behaviour (that does not look like safe), kernel 1 and kernel 2 would be serialized (as they are kind of using default stream, 0), but i’m not sure if we know anything about kernel 2 being serialized after the memcpy.

Hope that helps,
CÃ©dric

seibert · March 6, 2010, 3:53pm

As long as all operations are on the same stream, each operation has to finish before the next one starts.

I know that people have experimentally found the queue size to be between 24 and 32 (depending on device, I think) for just kernel calls. I don’t know if that same number holds when asynchronous memory copies are also queued up.

cudaprogrammer · March 6, 2010, 6:18pm

Thanks for the reply guys. A follow up question or two…

I have read that not specifying a stream uses the ‘default’ stream. If this is the case, would an actual stream need to be specified in this scenario (since all of these commands would use the default stream)?

Also, if the device buffer overflows with async commands, what is the system response? If the GPU being used is also the display card, does the system appear to lock up? If not, does the card just become non-responsive or give incorrect kernel results from dropping commands (that overflowed from the buffer)?

Thanks again.

seibert · March 6, 2010, 6:46pm

It’s not nearly that catastrophic. If you queue up more commands than the buffer can hold, then the calls wait until they can be queued up before returning.

cudaprogrammer · March 6, 2010, 7:31pm

seibert, thanks for the reply–very helpful to know the buffer overflow characteristics.

seibert or other forum membrs,

Just looking for clarification on the last part of my question. If you do not explicitly specify a stream, will the ‘async’ commands execute sequentially as part of the ‘default’ stream?

Thanks…

Topic		Replies	Views
Kernel Queueing CUDA Programming and Performance	8	9765	June 29, 2009
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1259	December 15, 2022
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	1095	February 1, 2022
Multiple async memcpy CUDA Programming and Performance	1	6425	December 16, 2011
cudaMemcpyAsync CUDA Programming and Performance	10	22029	October 16, 2015
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3315	July 14, 2010
cuda stream CUDA Programming and Performance	3	5904	April 6, 2011
streamed kernel syncs when it shouldn't ...or should it? CUDA Programming and Performance	3	1873	September 1, 2008
Much slower async memcpy in a separate stream than in stream 0 CUDA Programming and Performance	4	5270	July 23, 2015
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1830	June 23, 2010

simple asynchronous memcpy and kernel execution question

Related topics