0.9 asynchronous kernel question

A good news about 0.9 is that the call to the kernel function is asynchronous. However, I have a question about this. Please see the following code

float * o_data, *i_data;

//Initialize i_data which is the input data for the kernel function 

kernel0<<<grid, block>>>(o_data, i_data);

cudaError_t err0 = cudaGetLastError();

cudaMemcpy(i_data, o_data, count, cudaMemcpyDeviceToDevice)

kernel1<<<grid, block>>>(o_data, i_data);

cudaError_t err1 = cudaGetLastError();

My question

  1. is “cudaMemcpy” called after “kernel0” finished?

  2. is “kernel1” called after “cudaMemcpy” finished?

It seems that “cudaGetLastError” does not wait the previous call.

I hope the commands on the device are sequencely, though the host commands and device commands are asynchronous.

  1. and 2): Yes, this will work fine.

cudaGetLastError doesn’t wait for kernel0 to finish though, so you may see the error propagated from a different API call.

You need to synchronize with cudaThreadSynchronize() to make sure you get an error propagated from the correct function call.


Thank you very much for your reply.

I have a question about asynchronous kernel execution too. In the Programming Guide is not clearly set. Is it possible for a block from one kernel to run before the execution of a previous one to be complete?

For instance:

kernel0<<<gridsize0,blocksize0>>>( .... );

kernel1<<<gridsize1,blocksize1>>>( .... );

Is it possible that one block from kernel1 to be complete before some block from kernel0? Is the execution of different kernels sequential?

BTW: In the programming guide, Subsection says “The only functions from the runtime that are not asynchronous are the functions that perform memory copies between the host and the device, the functions that initializes and terminates interoperability with a OpenGL or Direct3D, and the functions that register, unregister, map, and unmap OpenGL buffer objects or a Direct3D vertex buffer”. So, if cudaMemcpyDeviceToDevice forces synchronization, it should be added to that paragraph, isn’t it?

Yes, the hardware ensures that kernel0 is completely done before kernel1 begins executing.

Intra-device memory copies should be asynchronous. They are performed in the same order that the API calls appear in the thread owning the CUDA context.

Ah, can I conclude from that, that the D2D copy is actually implemented as a small kernel? That would explain what people have seen here and I was actually wrong to assume there is a memory controller that handles it.

The D2D memcpy does not show up in the profiler log though :(


There’s a memory controller in there somewhere :) but if a device<->device memcpy could occur in parallel with launch processing, we would have the same synchronization problems as with host<->device copies. Without additional synchronization, the consumer doesn’t know when to start processing the data and the producer doesn’t know when it can play with its data again. In the case of device<->device copies, the ordering is enforced by the hardware so we don’t have to synchronize with the host.

We are planning to address these limitations in the future.

Yeah, please do. It would be very useful to be able to do H2D, D2H and D2D asynchronously or (even better) concurrently with a kernel running. I do accept that it would then be my responsibility to make sure the transfers don’t mess with current kernel memory accesses. I would accept mem alloc/dealloc being blocking, so kernel and copy resources can be validated properly in async processing.