When using streams, will the overlap of memcpy and kernel execution be the same if the API calls are done serially vs all memcpyAsync(copyTo) calls first, then all kernel calls, then all memcpyAsync(copyfrom) calls last, as shown in the example code?
Example, serial calls:
for (i=0; i<nStreams; i++) {
offset = i*N/nstreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, h_to_d, stream[i]);
kernel<<<N/(nThreads*nstreams), nThreads, 0, stream[i]>>>(a_d+offset);
cudaMemcpyAsync(a_h+offset, a_d+offset, size, d_to_h, stream[i]);
}
Versus, Example Overlapped calls:
for (i=0; i<nStreams; i++) {
offset = i*N/nstreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, h_to_d, stream[i]);
}
for (i=0; i<nStreams; i++) {
offset = iN/nstreams;
kernel<<<N/(nThreadsnstreams), nThreads, 0, stream[i]>>>(a_d+offset);
}
for (i=0; i<nStreams; i++) {
offset = i*N/nstreams;
cudaMemcpyAsync(a_h+offset, a_d+offset, size, d_to_h, stream[i]);
}
Thanks for the help.