Hi, I’m try to use stream to do some time overlap.
in the programming guide, it list the exsample
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
for (int i = 0; i < 2; ++i)
myKernel<<<100, 512, 0, stream[i]>>>
(outputDevPtr + i * size, inputDevPtr + i * size, size);
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
cudaThreadSynchronize();
for the example,
will the kernel lanch of stream(1) will wait for the copy of stream(1)?
or we have to do something to make the kernel launch wait for the previous copy when we want to make sure all of the data is copied and then be did the kernel operation?