I am using CUDA 9.1 on Windows 10 with a Pascal GPU. In this case the GPU is not shared with graphics, though I would be interested in that case as well.
I was running some experiments to see if I could use CUDA to offload some processing with real-time requirements. I only have a window of ~2ms available, and that needs to include everything:
One memory copy to the GPU
One memory copy back to the CPU
I am not working with much data (kByte, not MByte), so the memory copy bandwidth is not an issue. I am also using pinned memory so that there won’t be any page faults.
The kernel execution by itself is short (hundreds of microseconds according to cudaEvent timestamping). However, I found that no matter what I do, the memory copy operations alone cause me to sometimes miss the 2ms window.
I simplified an experiment down to just this sequence:
- cudaMemcpy one float to the GPU
- cudaMemcpy one float back to the CPU
On average that takes very little time (tens of microseconds), but the worst-case time is over 5ms. That will not work in my budget of 2ms.
Is there any way to reduce the timing variability of cudaMemcpy on Windows?
Are there alternative ways to move data that avoid the variability that I see when using cudaMemcpy?