Why is there still several CPU-to-GPU transfers between two consecutive kernel calls in CUDA?

I have already packed the data and transferred it to the GPU. The relationship between the two kernel functions is that the value computed by the first kernel function is used by the second kernel function. There shouldn’t be any data transfer between these two kernel functions, so why is there still several transfers from the CPU to the GPU?
This is the behavior between the first and second kernel functions.


Then, it is the data transfer behavior between these two kernel functions as measured by Nsight Systems.

Additionally, I want to mention that I have already allocated pinned memory, and all the arrays used have been allocated memory. I have only encapsulated the global subroutine

Difficult to say without a reproducer. Could be the Fortran descriptors being copied as we’re discussing in your other post.

Sure, let me clarify your meaning. What I previously mentioned about the small transfers being caused by not using a module to encapsulate memory allocation, data transfer, and kernel functions was incorrect. What I meant is that you might not be able to see the actual examples, and for now, we can only suspect that the descriptors in the transfer are affecting the transfer time.