How to avoid lag phase when iteratively accessing GPU memory?

I am trying to run a program on my cluster’s GPU partition (we are using Tesla P100s). The program iteratively accesses GPU memory (although we realize this is not ideal, we do not have CUDA expertise and there are some calculations we need to run on our CPUs). It seems that every time the program tries to access GPU memory, there is a huge startup lag for the card. Our non-GPU version of the program runs much more quickly.

What is the likely problem here?

CUDA often has a start-up lag. This can in some cases be mitigated by setting persistence mode, if it is not set.

Once a process has started and established a context on the GPU, the start-up cost should only be incurred once, not each time that process accesses the GPU.

On the other hand if by “iteratively accessing” you mean a CPU process starts up, does something on the GPU, then quits, and then another process starts up, accesses the GPU, then quits, then in that case there will be some CUDA start-up lag associated with each new process. In that case, I would look to create an executive that issues work to the GPU, but remains running/active, to avoid the new-process-startup delay.