CUDA caused frequently execution latencies up to 30 ms. Is there a hidden CUDA thread running?

I am having a similar problem, I’ll add my observations, maybe they’ll help.

Page locked memory + synchronous memory copy waits for other kernel operations to be completed. So if you are using multiple threads they might be conflicting with each other. Robert said something about it:

When I switched to pinned memory + asynchronous memory copy I saw that the Host to Device tranfer delays are gone but now the overhead is transferred to the StreamSynchronization or DeviceSynchronization. I’m now trying to find an answer as to why synchronization is taking too long.