A good news about 0.9 is that the call to the kernel function is asynchronous. However, I have a question about this. Please see the following code
float * o_data, *i_data; //Initialize i_data which is the input data for the kernel function kernel0<<<grid, block>>>(o_data, i_data); cudaError_t err0 = cudaGetLastError(); cudaMemcpy(i_data, o_data, count, cudaMemcpyDeviceToDevice) kernel1<<<grid, block>>>(o_data, i_data); cudaError_t err1 = cudaGetLastError();
is “cudaMemcpy” called after “kernel0” finished?
is “kernel1” called after “cudaMemcpy” finished?
It seems that “cudaGetLastError” does not wait the previous call.
I hope the commands on the device are sequencely, though the host commands and device commands are asynchronous.