I’m using nsight system to profile my program.
I found that if I have some background tasks together with my program + MPS, I can have a up to 20ms delay between cudaMemcpy and the kernel launch.
What’s this delay? How can I resolve it?
Abnormal case
Normal Case

