cudaMemcpyAsync blocks and has long Runtime API duration

Hi

I have a setup where I want to execute multiple streams in parallel. To each stream I commit multiple Async H2D, 1 kernel execution and 1 D2H transfer. Using NVIDIA visual profiler, I see that the kernel starts right after H2D, and D2H starts right after the kernel. However, the function doing does not return before D2H is finished (as if it would have been a synchronized memcpy).

More observations: D2H takes 2.9 ms, but the runtime API uses 266 ms on this call.

Removing D2H lets the H2D and Kernel run parallel.

What is the cause of this? Is there some implicit synchronization anywhere?

My function does this:

cudaMemcpyAsync(&p_frame_buffer->d_image_buffer[num_images*image_size],
                p_image,
                p_frame_buffer->image_size,
                cudaMemcpyHostToDevice,
                p_cuda_streams[m_current_stream]);

process_data<<< num_blocks, num_threads, 0, p_cuda_streams[m_current_stream]>>>
            (p_frame_buffers[m_current_stream].d_image_buffer,
             d_output_buffer_structs[m_current_stream]);

cudaMemcpyAsync(p_output_buffer->p_buffer,
                d_output_buffers[m_current_stream],
                p_output_buffer->buffer_size,
                cudaMemcpyDeviceToHost,
                p_cuda_streams[m_current_stream]);