simple asynchronous memcpy and kernel execution question

Forgive me if this question has already been answered, but I searched the posts and did not find anything. If you have the general pseudo code of


Will the device searialize these calls? I know kernel calls are sequential (at least until Fermi), but in the above example, will the memcopy complete before kernel2 executes? I have many of these operations and I don’t want the host to waste time waiting for blocking calls to return, if I can help it. If the above is serialized, is there a practical limit to the number one could call befor filling up the queue on the device? Is that documented as part of the device or is it a driver issue that one could characterize?

Thanks in advance for your help.


If you specify a stream in the kernel1 and kernel2 calls, then if the async memcpy is using the same stream, the calls would be serialized. This is not clearly specified in the documentation as far as i know unfortunately.

Now if we consider the case where you do not give a stream when launching the kernels, i don’t know what would be the behaviour (that does not look like safe), kernel 1 and kernel 2 would be serialized (as they are kind of using default stream, 0), but i’m not sure if we know anything about kernel 2 being serialized after the memcpy.

Hope that helps,

As long as all operations are on the same stream, each operation has to finish before the next one starts.

I know that people have experimentally found the queue size to be between 24 and 32 (depending on device, I think) for just kernel calls. I don’t know if that same number holds when asynchronous memory copies are also queued up.

Thanks for the reply guys. A follow up question or two…

I have read that not specifying a stream uses the ‘default’ stream. If this is the case, would an actual stream need to be specified in this scenario (since all of these commands would use the default stream)?

Also, if the device buffer overflows with async commands, what is the system response? If the GPU being used is also the display card, does the system appear to lock up? If not, does the card just become non-responsive or give incorrect kernel results from dropping commands (that overflowed from the buffer)?

Thanks again.

It’s not nearly that catastrophic. If you queue up more commands than the buffer can hold, then the calls wait until they can be queued up before returning.

seibert, thanks for the reply–very helpful to know the buffer overflow characteristics.

seibert or other forum membrs,

Just looking for clarification on the last part of my question. If you do not explicitly specify a stream, will the ‘async’ commands execute sequentially as part of the ‘default’ stream?