Is simultaneous D2D mem copy possible?

Quadro P4000


  • Two CPU threads
  • Two Cuda Streams (1 context).
  • Thread 1, Stream 1, Copy from Device memory A to B.
  • Thread 2, Stream 2, Copy from Device memory C to D.
  • Both threads using cudaMemcopyAsync.
  • I expected the dual-copy engines would allow the mem copies to run in parallel.

    From NSIGHT, it does not seem that this is the case.

    As an additional experiment, I modified Thread2 to kick-off a busy-wait Kernel. So Thread2 now queue the kernel, then the cudaMemcopyAsyn and finally doing a stream-synchronize.

    Here I saw that a mem copy from Thread 1 would execute in parallel to kernel but then the next thread1 copy would be blocked until Thread2 mem copy finished.

    Note: I do sometimes see an minor overlap between the two streams (copies).
    Note: Since these are async calls, using NSIGHT, I can see all the requests coming in (and queued), a small delay, and then the copies and kernel start running.


    1. Why are the D2D mem copies not running in parallel?
    2. What role do the dual-copy engines play here? Are they strictly used for off-GPU (host) copies?
    3. Is it possible to have true simultaneous D2D mem copies?

    Thank you…

    These are all interesting questions.

    However, there are limits to memory bandwidth. A D2D memcpy issued by the CUDA runtime (except for very small copies) should be able to come very close to saturating the memory bandwidth. If you did have 2 (or more) memcopy operations that were running “in parallel”, the best they could do is to (not perfectly) split the available bandwidth between them.

    It’s not obvious to me that overlapping D2D mem copies could provide any tangible benefit.

    This question is very similar to questions asking why large kernel launches are not witnessed to run “in parallel” even though “all conditions” necessary for concurrent execution have been satisfied.

    The machine does not have infinite capacity. If an operation fills the machine capacity for a particular operation type, a sensible conclusion is that other operations of the same type have no opportunity to run “in parallel”.

    GPUs are designed for maximized throughput, even where this comes at the expense of latency. They are also designed with as simple control mechanisms as is possible, in order to maximize resources that do actual (data-transforming) work. This also reduces the amount of energy required to accomplish a given task.

    Allowing operations to proceed in parallel when one operation is already capable of utilizing the machine fully runs counter to these design goals: (1) Overall throughput would not increase and likely see a minor decrease as efficiency of multiple concurrent operations would be less than 100% (2) Additional control logic would be required.

    In contrast, CPUs are typically designed for low-latency operation, even at the expense of significant amounts of redundant energy-consuming work, and they contain massive amounts of control logic and data storage in the pursuit of this goal.

    Thank you both for the info. it’s definitely good to have this extra info.

    My question, however, I think, uses a very simple scenario; no real computation is happening, I am only, really, exercising the scheduler and copy capability.

    The “Keep Busy” kernel is the good-old clock64 tick loop. And then we have the copies.

    I have read about the dual-copy engines in the Quadro products, and, according to the documentation, it should allow for simultaneous copies -

    Note: I placed the P4000 into TCC mode (using a P2000 for display) and I did not see any better overlap or improved simultaneous copies.

    I modified my test and now Stream1 is performing a H2D copy. I decreased the amount of H2D memory being copied as the previous 64MB H2D took about the same amount of time as the 512MB D2D (no surprise). I decreased the H2D so that I can clearly see multiple H2D mem copies occurring during a single D2D mem copy.

    In this new scenario, I can clearly see simultaneous copies.

    So this brings me back to my previous questions, are the “Dual-Copy” engines strictly for Off-GPU copies (H2D and/or D2H)? Is there a way to have true simultaneous D2D mem copies?

    Thank you both, again.


    Practically speaking: Yes. You may want to look up the generic technical term: DMA engine, where DMA = direct memory access. Since PCIe is a full-duplex interconnect, dual copy engines allow simultaneous data transport H2D and D2H, which is important for some use cases. It is easy to set up test cases that demonstrate such concurrent copies.


    copy engines are for arranging copies to/from external sources AFAIK

    AFAIK the exact mechanism(s) for D2D copies are unpublished.

    I think in practice it will be difficult or impossible to witness overlapping D2D copies for the reason I already stated. The trick that makes it “easy” to witness kernel concurrency (restricting kernel resource usage) has no analog in the D2D copy case, that I can think of.

    This question about copies strikes me as a potential XY problem. What are you trying to accomplish here at a higher level of abstraction? Data copies should always be minimized because they take up time and lots of energy without providing data-transformative work.

    GpuDirect - Data is pushed directly into GPU memory (BAR1 region). This data needs to now be moved to FB so a kernel can process it.

    It would be nice to be able to copy data to FB while the kernel is processing and copying results to an output buffer (inside the GPU) at the same time.

    Simultaneous Kernel (stream 1) / Copy (stream 2) doesn’t seem to be an issue, however, what does seem to be a potential problem is that if I queue a Kernel + Copy (same stream) this copy can block a copy from a different stream (see attached picture, original post).

    The kernel could directly process the data from BAR1, however, some sort of hand-shaking would be needed between GPU and external device to avoid data-overwrite.

    Thankfully, intra-GPU copies happen so fast, so even with these delays it may not be a problem…

    Any reason you can’t use a classic double-buffer scheme, i.e. one buffer receives data from outside source while other buffer is being processed, then the two buffers swap roles?

    Right, so the double-buffer would live in FB. The BAR1 region is too small as I am looking to process over 512MB of data / sec; though I am looking at changing the kernel to allow processing of smaller chunks as the data arrives instead of waiting for the full 512MB.

    How big is the BAR1 region for the Quadro P4000? Processing in relatively small chunks, say 16MB each, would intuitively seem advantageous in a variety of ways, but I don’t know the details of your use case (e.g. what kind of processing, how much output data results and where does it go).

    256MB but I have only been successful at allocating 220MB.

    [redundant post deleted]

    That’s a decently-sized BAR1 region nonetheless. I would suggest looking into processing data at finer granularity than you are currently envisioning. In terms of PCIe efficiency, you should see the full bandwidth at transfers sizes of 16MB and higher. But then your bandwidth requirements are less than 1/20 of the PCIe bandwidth available (~12.5 GB/sec on a gen3 x16 link), so that shouldn’t be much of a concern.

    And so the beauty of GpuDirect. NVIDIA did a great thing here, allowing me to bypass the CPU.

    NVLINK is another great feature that I hope to take advantage of! All they need to do is provide the API to allow P2P between WDDM and TCC GPUs - One for computer, another for display processing. Currently one needs to go via the host (Not an issue if you’re on a non-Windows OS).

    WDDM is controlled by Microsoft. Observing its historical progression, each new version seems to make it harder for NVIDIA to retain some modicum of control over the GPU. Based on that, it seems unlikely that there will be major new functionality will emerge for GPUs used with WDDM drivers.

    VMD ( is an excellent example of the kind of impressive visualization that is possible in conjunction with CUDA-accelerated applications with currently existing technology.