CUDA: combining H2D and D2H memory transfer operations

Playing with CUDA i’ve reached streams. And now i’m trying to get the maximum of streams ideology.
Till now i was able to pack my 3-stream processing by combining kernel operations and memory operations.
Using GeForce 650 GTX i believe it should have 2 memory devices and should be able to perform H2D and D2H operations simultaneously.

What do i do: i have created 3 streams, allocated 3 memory blocks in pinned memory.
Now i do the cycle. Every iteration i order every stream to perform next operation asynchronously.

  • H2D operation for (i)-th stream
  • kernel procedure for (i-1)-th stream
  • D2H operation for (i-2)-th stream.

This cycle means that every stream will perform H2D,kernel,D2H operations by cycle.

Monitored the system with NSight VS Edition. To make it all more visible i have made memory transfer operations long enough (512Kb) and kernel procedure performing lots of local floating point operations for every thread. This way it is easier to see timeline.

As i can see my memory operations take the whole timeline with little to none space between them. And kernel procedures executing in perfect parallel with memory operations.
But memory operations are being executed one by one. H2D do not want to be paralleled with D2H operations.

Do i miss something here?

PS: i don’t want to put my code here because it is little complicated because of details. On the other hand the code is pretty standard for vector addition examples.


GeForce cards only have 1 asynchronous copy engine so you will not be able to do obtain concurrent h2d and d2h on your device. The exception is if the memory transfer is small then it may be implemented using a mechanism other than the copy engine.

While I have not actually used this feature in any project, I thought that the Geforce GTX 980 has two:

Device 0: "GeForce GTX 980"
  CUDA Driver Version / Runtime Version          7.0 / 6.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 4096 MBytes (4294967296 bytes)
  (16) Multiprocessors, (128) CUDA Cores/MP:     2048 CUDA Cores
  GPU Clock rate:                                1367 MHz (1.37 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Texture alignment:                             512 bytes
<u><b>  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)</b></u>
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           2 / 0

So are they not asynchronous ?

Interesting observation! Based on my knowledge, I would have agreed with Greg: There is only one DMA engine on consumer GPUs. Now I wonder whether the two copy engines reported for your GTX 980 are a “premium feature” found on high-end consumer GPUs, a new feature on Maxwell-based consumer cards, a bug in the driver, or a bug in the app that reports the capabilities. When the CUDA 7.0 release goes final, I will check the documentation as to what it says about dual DMA engines.

I think the small transfers that Greg is referring to are those small host->device transfers that are injected directly into the GPU’s command queue and that are therefore independent of any potentially concurrent device->host transfera by a copy engine. That was more intended as a latency optimization though (instead of sending a command to the GPU that then turns on the DMA engine to fetch the data, just send the data itself), rather than an attempt to improve the concurrency of transfers. I seem to recall a 64 KB size limit for such copies. A microbenchmark could probably pinpoint the exact limit but I am too lazy right now to write one.

Is there any existing CUDA SDK sample or other test code I could run to test?

Have you checked the CUDA samples? While I am not aware of one, they may include a test that allows the testing of concurrent transfers. It is not difficult to write one, basically a simple bandwidth test with increasing block sizes with two CUDA streams that can copy simultaneously if the hardware/driver allows it. The difference in execution time to a single-stream configuration executing the same transfers should clearly show whether uploads and downloads happen simultaneously or not.

Ok, so I tried this quick test on the GTX 980 to see if there are two copy engines and this was the nvprof output which was interesting:

Using single GPU GeForce GTX 980
CPUtime= 7
==7096== Profiling application: ConsoleApplication1.exe
==7096== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
224.00ms  3.8400us                    -               -         -         -         -  65.536KB  17.067GB/s  GeForce GTX 980         1         7  [CUDA memset]
224.15ms  3.1232ms                    -               -         -         -         -  33.554MB  10.744GB/s  GeForce GTX 980         1        14  [CUDA memcpy DtoH]
224.15ms  3.1309ms                    -               -         -         -         -  33.554MB  10.717GB/s  GeForce GTX 980         1        13  [CUDA memcpy HtoD]
227.32ms  395.52us          (32768 1 1)       (256 1 1)         8        0B        0B         -           -  GeForce GTX 980         1        13  test_kernel_1(int*, int, int) [197]
227.72ms  397.57us          (32768 1 1)       (256 1 1)         8        0B        0B         -           -  GeForce GTX 980         1        14  test_kernel_1(int*, int, int) [202]
228.14ms  3.1446ms                    -               -         -         -         -  33.554MB  10.671GB/s  GeForce GTX 980         1        14  [CUDA memcpy HtoD]
228.14ms  3.1197ms                    -               -         -         -         -  33.554MB  10.756GB/s  GeForce GTX 980         1        13  [CUDA memcpy DtoH]

Which seems to indicate that it can perform a Device to Host copy concurrently with a Host to Device copy if they are in different streams.

Nvvp also shows graphically that those two copies of the same size in different directions occur during the same interval of time, which supports the statement that there are two two copy engines in the GTX 980.

When I run the same code on the same PC and specify the GTX 780ti GPU the output is different:

==6972== Profiling application: ConsoleApplication1.exe
==6972== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
227.84ms  4.3840us                    -               -         -         -         -  65.536KB  14.949GB/s  GeForce GTX 780         1         7  [CUDA memset]
227.89ms  2.7317ms                    -               -         -         -         -  33.554MB  12.284GB/s  GeForce GTX 780         1        13  [CUDA memcpy HtoD]
230.62ms  2.5920ms                    -               -         -         -         -  33.554MB  12.945GB/s  GeForce GTX 780         1        14  [CUDA memcpy DtoH]
233.25ms  271.87us          (32768 1 1)       (256 1 1)         8        0B        0B         -           -  GeForce GTX 780         1        13  test_kernel_1(int*, int, int) [198]
233.52ms  271.36us          (32768 1 1)       (256 1 1)         8        0B        0B         -           -  GeForce GTX 780         1        14  test_kernel_1(int*, int, int) [203]
233.81ms  2.5772ms                    -               -         -         -         -  33.554MB  13.020GB/s  GeForce GTX 780         1        13  [CUDA memcpy DtoH]
236.39ms  2.7450ms                    -               -         -         -         -  33.554MB  12.224GB/s  GeForce GTX 780         1        14  [CUDA memcpy HtoD]

So it does look like there are two copy engines in the GTX 980, unless I am interpreting this incorrectly.

I agree, the logs definitely suggest the GTX 980 is overlapping the copies in opposite directions while the GTX 780 Ti is not. That is consistent with the number of copy engines reported for the two cards.

Cool! On the other hand it is weird. I have not come across any official reference to the dual copy engines on the GTX 980, you’d think this would be on a “new and improved” marketing slide somewhere.