I have a typical for loop that asynchronously copies data to device and calls (also asynchronously) kernels that process those chunks of data.
My question is: how to control that kernels executed successfully?
For the synchronous case we call:
cudaDeviceSynchronize(); cudaGetLastError();
But, is the cudaGetLastError() a proper choice for the async case? Currently, I do the following:
for( size_t si = 0; si < streams_num; si++ ) {
cudaMemcpyAsync(..., streams[si]);
kernel<<<....,streams[si]>>>();
checkCudaErrors(cudaStreamSynchronize(streams[si]));
checkCudaErrors(cudaGetLastError());
}
checkCudaErrors is an auxiliary macros that handles the return code.