cudaGetLastError() for asynchronous calls

I have a typical for loop that asynchronously copies data to device and calls (also asynchronously) kernels that process those chunks of data.

My question is: how to control that kernels executed successfully?

For the synchronous case we call:

cudaDeviceSynchronize(); cudaGetLastError();

But, is the cudaGetLastError() a proper choice for the async case? Currently, I do the following:

for( size_t si = 0; si < streams_num; si++ ) {
cudaMemcpyAsync(..., streams[si]);
kernel<<<....,streams[si]>>>();
checkCudaErrors(cudaStreamSynchronize(streams[si]));
checkCudaErrors(cudaGetLastError());
}

checkCudaErrors is an auxiliary macros that handles the return code.

google “proper cuda error checking”

take the first hit

it explains how to handle the API call case, and the kernel call case