cudaMemcpy Timing Variability on Windows

I am using CUDA 9.1 on Windows 10 with a Pascal GPU. In this case the GPU is not shared with graphics, though I would be interested in that case as well.

I was running some experiments to see if I could use CUDA to offload some processing with real-time requirements. I only have a window of ~2ms available, and that needs to include everything:

One memory copy to the GPU
Kernel execution
One memory copy back to the CPU

I am not working with much data (kByte, not MByte), so the memory copy bandwidth is not an issue. I am also using pinned memory so that there won’t be any page faults.

The kernel execution by itself is short (hundreds of microseconds according to cudaEvent timestamping). However, I found that no matter what I do, the memory copy operations alone cause me to sometimes miss the 2ms window.

I simplified an experiment down to just this sequence:

  1. cudaMemcpy one float to the GPU
  2. cudaMemcpy one float back to the CPU

On average that takes very little time (tens of microseconds), but the worst-case time is over 5ms. That will not work in my budget of 2ms.

Is there any way to reduce the timing variability of cudaMemcpy on Windows?
Are there alternative ways to move data that avoid the variability that I see when using cudaMemcpy?

If your GPU is in WDDM mode, WDDM can introduce significant jitter.

If you have only a small amount of data and you are pinning it anyway, you can read that data directly from GPU code and forgo the memcopy operations altogether. This is referred to colloquially as “zero-copy” and you will find many examples of it and discussion of it with a bit of searching.

Having said that, if WDDM jitter is the culprit, it can introduce jitter into the kernel launch process as well.

You can try to reduce WDDM jitter with various undocumented techniques to “flush” the WDDM command queue, such as creating an event, issuing an event after the kernel call, then immediately doing a cudaEventQuery on the event. If you want a hard guarantee of a particular timing, I’m not sure you can achieve that with WDDM, however.

GeForce GPUs can only be operated in WDDM mode. Titan or Quadro GPUs can be placed into TCC mode, which will remove any WDDM effects from the situation.

Thank you for responding!

I have only tested with GeForce GPUs, so that is good to know about WDDM driver limitations. I will look into that.

I read about zero-copy memory a little in the past, but I could not tell if it would help. At a high level, I had these two main questions about it:

  1. It sounded like the typical use case for zero-copy was for a CPU and GPU with truly shared memory (like on a Jetson TX2). In my case there is a PCIe bus between the GPU and CPU. Does that force driver code on the CPU to get involved, or will the transfer truly occur on the hardware alone without any CPU intervention?
  2. It seemed like Unified Memory had more examples and documentation than zero-copy memory. Is there a way for Unified Memory to accomplish the same thing as far as allowing the GPU to be the master of memory transfers?

with mapped and pinned memory (“zero copy”), there is no interaction with CPU code. The transactions (read or write) are entirely carried out by the PCIE and memory controller subsystems of the CPU.

Unified memory under windows on CUDA 9.x will behave in a fashion similar to explicit cudaMemcpy operations. The CUDA runtime will transfer any managed memory data to the GPU (more or less as if you issued a cudaMemcpy) at the point of kernel launch, prior to actually launching the kernel. You can read more about it in the unified memory section of the CUDA programming guide. How exactly that may interact with WDDM I would not expect to be any different than if you had issues a cudaMemcpy operation (i.e. there will be commands issued in the WDDM command queue), but I couldn’t describe in detail if there would be any meaningful difference with UM. You could certainly try it.