cudaMemcpyAsync blocks and has long Runtime API duration

RubinG · December 10, 2016, 1:06pm

Hi

I have a setup where I want to execute multiple streams in parallel. To each stream I commit multiple Async H2D, 1 kernel execution and 1 D2H transfer. Using NVIDIA visual profiler, I see that the kernel starts right after H2D, and D2H starts right after the kernel. However, the function doing does not return before D2H is finished (as if it would have been a synchronized memcpy).

More observations: D2H takes 2.9 ms, but the runtime API uses 266 ms on this call.

Removing D2H lets the H2D and Kernel run parallel.

What is the cause of this? Is there some implicit synchronization anywhere?

My function does this:

cudaMemcpyAsync(&p_frame_buffer->d_image_buffer[num_images*image_size],
                p_image,
                p_frame_buffer->image_size,
                cudaMemcpyHostToDevice,
                p_cuda_streams[m_current_stream]);

process_data<<< num_blocks, num_threads, 0, p_cuda_streams[m_current_stream]>>>
            (p_frame_buffers[m_current_stream].d_image_buffer,
             d_output_buffer_structs[m_current_stream]);

cudaMemcpyAsync(p_output_buffer->p_buffer,
                d_output_buffers[m_current_stream],
                p_output_buffer->buffer_size,
                cudaMemcpyDeviceToHost,
                p_cuda_streams[m_current_stream]);

Topic		Replies	Views
cudaHostAlloc memory initial time CUDA Programming and Performance	0	389	August 19, 2018
cudaMemcpyAsync Func Used too long time. CUDA Programming and Performance	5	2557	July 15, 2019
Question about cudaMemcpyAsync measurement in nsys and async meaning Profiling Linux Targets nsight	0	865	February 22, 2022
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	354	August 22, 2024
Questions about when using cudaMemcpyAsync(), the host is blocked CUDA Programming and Performance	6	3725	April 5, 2018
D2H cudaMemcpyAsync Blocks Irrelevant Kernels CUDA Programming and Performance cuda , kernel	4	1205	July 29, 2022
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	723	October 19, 2023
cuMemcpyDtoHAsync acts like a Synconized Call CUDA Programming and Performance	2	636	October 31, 2019
performance variation when using asynchronous calls CUDA Programming and Performance	1	662	February 11, 2011
cudaMemcpyAsync not giving any answers using cudaMemcpyAsync function CUDA Programming and Performance	1	844	September 5, 2011

cudaMemcpyAsync blocks and has long Runtime API duration

Related topics