Why cuda kernel computation cannot overlap with CPU to GPU data transfer?

I tried to prefetch some data before the computation start. It is strange that sometimes the computation and CPU to GPU transfer can overlap well, but sometimes it just execute in sequence. According to nsight system results, we can observe that the kernel execution time between 10829 and 10830 is around 9ms. However, there is no dependency between the data copy and computation.

I just repeat the similar process for a few times. At another time point, the two can overlap very well. The code is exact the same. May I know what the problem cause this?

Is there any change in behavior if you profile your code specifying

CUDA_MODULE_LOADING=EAGER nsys profile ...

?