Help with CUDA streams

Hi,
I am new to using cuda streams. I looked at the forum but didnot find a suitable answer.
are the following sections of code same or different. if so why? When I run them, they give different results.

Code 1:
for(int i = 0; i < num_streams; i++)
cudaMemcpyAsync(d_a + i * n / num_streams, a + i * n / num_streams, nbytes / num_streams, cudaMemcpyHostToDevice, gpu_streams[i]);

for(int i = 0; i < num_streams; i++)
    init_array<<<blocks, threads, 0, gpu_streams[i]>>>(d_a + i * n / num_streams, d_c, 1);

for(int i = 0; i < num_streams; i++)
    cudaMemcpyAsync(a + i * n / num_streams, d_a + i * n / num_streams, nbytes / num_streams, cudaMemcpyDeviceToHost, gpu_streams[i]);

Code 2:
for(int i = 0; i < num_streams; i++){
cudaMemcpyAsync(d_a + i * n / num_streams, a + i * n / num_streams, nbytes / num_streams, cudaMemcpyHostToDevice, gpu_streams[i]);
init_array<<<blocks, threads, 0, gpu_streams[i]>>>(d_a + i * n / num_streams, d_c, 1);
cudaMemcpyAsync(a + i * n / num_streams, d_a + i * n / num_streams, nbytes / num_streams, cudaMemcpyDeviceToHost, gpu_streams[i]);
}

Thanks.

They will probably get scheduled a bit differently in terms of concurrent copy and execute (you should get better overlap with the first version from my experience).

My guess is that the copy or kernel overlap somewhere and so the results depends on exact scheduling in terms of what’s in memory (impossible to tell from what you have written). That is, the first version has a better chance of the first kernel running already when the second memcpy occurs, etc.