About Stream control

Hi, I’m try to use stream to do some time overlap.

in the programming guide, it list the exsample

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,

size, cudaMemcpyHostToDevice, stream[i]);

for (int i = 0; i < 2; ++i)

myKernel<<<100, 512, 0, stream[i]>>>

(outputDevPtr + i * size, inputDevPtr + i * size, size);

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,

size, cudaMemcpyDeviceToHost, stream[i]);

cudaThreadSynchronize();

for the example,

will the kernel lanch of stream(1) will wait for the copy of stream(1)?

or we have to do something to make the kernel launch wait for the previous copy when we want to make sure all of the data is copied and then be did the kernel operation?

AFAIK this is precisely what streams are intended to do