Delay between cudaMemcpy and kernel launch with MPS

I’m using nsight system to profile my program.

I found that if I have some background tasks together with my program + MPS, I can have a up to 20ms delay between cudaMemcpy and the kernel launch.

What’s this delay? How can I resolve it?

Abnormal case

Normal Case