cuda (Newbie question) when using streams, does the order of the Async calls make a difference?

When using streams, will the overlap of memcpy and kernel execution be the same if the API calls are done serially vs all memcpyAsync(copyTo) calls first, then all kernel calls, then all memcpyAsync(copyfrom) calls last, as shown in the example code?

Example, serial calls:

for (i=0; i<nStreams; i++) {
offset = i*N/nstreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, h_to_d, stream[i]);

kernel<<<N/(nThreads*nstreams), nThreads, 0, stream[i]>>>(a_d+offset);

cudaMemcpyAsync(a_h+offset, a_d+offset, size, d_to_h, stream[i]);
}

Versus, Example Overlapped calls:

for (i=0; i<nStreams; i++) {
offset = i*N/nstreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, h_to_d, stream[i]);
}

for (i=0; i<nStreams; i++) {
offset = iN/nstreams;
kernel<<<N/(nThreads
nstreams), nThreads, 0, stream[i]>>>(a_d+offset);
}

for (i=0; i<nStreams; i++) {
offset = i*N/nstreams;
cudaMemcpyAsync(a_h+offset, a_d+offset, size, d_to_h, stream[i]);
}

Thanks for the help.

Yes. The rule for best performance is to basically launch breadth first over all your streams first before launching a second operation in any stream.