I have already packed the data and transferred it to the GPU. The relationship between the two kernel functions is that the value computed by the first kernel function is used by the second kernel function. There shouldn’t be any data transfer between these two kernel functions, so why is there still several transfers from the CPU to the GPU?
This is the behavior between the first and second kernel functions.
Then, it is the data transfer behavior between these two kernel functions as measured by Nsight Systems.
Additionally, I want to mention that I have already allocated pinned memory, and all the arrays used have been allocated memory. I have only encapsulated the global subroutine