Cuda stream stalls due to memcpyAsync --- even when memory copy performing is idle?

I’m seeing some strange stalls in my application at the point where I’m memcpyAsync.


During these 7ms, cuda is only executing the kernel and does not perform memory copies simultaneously.I’m not sure why cuda stream cannot overlap memory copy and execute kernel.

The picture shows a single host-to-device transfer.
Can you show the cuda api timeline as well, and explain where a stall happens?


the cuda api timeline
stream 127 is being two cudaMemcpyAsync(cudaMemcpyHostToDevice).
other streams are being do tensorRT inference.