Overlap cudaMemcpyAsync and kernel

alphard.ayer · February 10, 2021, 7:12pm

Hi,

I create stream:

cudaStream_t stream;
cudaStreamCreate(&stream);

After I launch the kernel. This kernel fills the data:

kernel1<<<1, 128, 0, stream >> >(output);`

Now, I want to copy this output data to cpu and simultaneously launch another kernel that uses output:

cudaMemcpyAsync(cpu, output, size, cudaMemcpyDeviceToHost);
kernel2<<<1,128,0,stream>>>(output);

But kernel2 will start only after the data is copied on cpu, since both use the same stream.
If I run any of them in another stream, I lose the guarantee that the output is ready, obviously.
How to launch cudaMemcpyAsync and kernel2 in parallel, ensuring that the output is ready?

njuffa · February 10, 2021, 8:03pm

You will need to synchronize between the two streams at the appropriate points with cudaStreamWaitEvent. There may be a relevant example among the sample apps that ship with CUDA. Have you checked?

Topic		Replies	Views
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	1084	February 1, 2022
cudaMemcpyAsync with cudaMemcpyHostToDevice does not implicitly synchronize with stream CUDA Programming and Performance	2	96	July 2, 2025
cudaMemcpyAsync CUDA Programming and Performance	1	4900	December 8, 2008
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	391	November 18, 2024
About Stream control CUDA Programming and Performance	1	986	March 26, 2009
Overlapping CPU and GPU code. CUDA Programming and Performance	6	1690	February 27, 2016
Questions about "cudaMemcpyAsync" Legacy PGI Compilers	1	2408	November 18, 2011
How to use streams for asynch transfers CUDA Programming and Performance	3	943	February 18, 2011
Question about CUDA streams CUDA Programming and Performance	8	850	November 8, 2019
cudaMemcpyAsync CUDA Programming and Performance	10	21857	October 16, 2015

Overlap cudaMemcpyAsync and kernel

Related topics