CUDA caused frequently execution latencies up to 30 ms. Is there a hidden CUDA thread running?

in my simple image processing consisting of:

  • memory copy from CPU to GPU
  • CUDA kernel execution
  • memory copy from GPU to CPU
    i observe regulaty peaks in execution time causing latencies of my image processing. I figured out that it comes from GPU to CPU memory transfer. When using page locked CPU memory the variation in execution times is much lower. It looks like there is an internal CUDA thread frequently running aprx. 100 ms which delays my execution for 5-30 ms. Is there a possibility to control that hidden CUDA thread?
    I’m using Jetson TX2 with ubuntu and CUDA 10.2
    Thanks in advance

I am having a similar problem, I’ll add my observations, maybe they’ll help.

Page locked memory + synchronous memory copy waits for other kernel operations to be completed. So if you are using multiple threads they might be conflicting with each other. Robert said something about it:

When I switched to pinned memory + asynchronous memory copy I saw that the Host to Device tranfer delays are gone but now the overhead is transferred to the StreamSynchronization or DeviceSynchronization. I’m now trying to find an answer as to why synchronization is taking too long.