I create stream:
After I launch the kernel. This kernel fills the data:
kernel1<<<1, 128, 0, stream >> >(output);`
Now, I want to copy this output data to cpu and simultaneously launch another kernel that uses output:
cudaMemcpyAsync(cpu, output, size, cudaMemcpyDeviceToHost);
But kernel2 will start only after the data is copied on cpu, since both use the same stream.
If I run any of them in another stream, I lose the guarantee that the output is ready, obviously.
How to launch cudaMemcpyAsync and kernel2 in parallel, ensuring that the output is ready?