Overlap cudaMemcpyAsync and kernel

Hi,

I create stream:

cudaStream_t stream;
cudaStreamCreate(&stream);

After I launch the kernel. This kernel fills the data:

kernel1<<<1, 128, 0, stream >> >(output);`

Now, I want to copy this output data to cpu and simultaneously launch another kernel that uses output:

cudaMemcpyAsync(cpu, output, size, cudaMemcpyDeviceToHost);
kernel2<<<1,128,0,stream>>>(output);

But kernel2 will start only after the data is copied on cpu, since both use the same stream.
If I run any of them in another stream, I lose the guarantee that the output is ready, obviously.
How to launch cudaMemcpyAsync and kernel2 in parallel, ensuring that the output is ready?

You will need to synchronize between the two streams at the appropriate points with cudaStreamWaitEvent. There may be a relevant example among the sample apps that ship with CUDA. Have you checked?