How to Overlap Data Transfers in CUDA C/C++

jwitsoe · November 11, 2013, 11:17pm

Originally published at: https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/

In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other…

anon96011494 · October 14, 2015, 3:16pm

Thanks for the great article.
I suspect your cudaMemcpyAsync() invocations in the first example are missing the "kind" argument.

anon95180265 · October 20, 2015, 9:43pm

Thanks for noticing! I fixed this.

anon27398678 · October 28, 2015, 9:26am

Nice article. Small suggestion: the behavior of the default stream with respect to synchronization has changed throught different CUDA versions sync the article has been written (e.g. no more implicit sync with CUDA 7). It would be useful to add a small recap of the behavior according to the version.

anon95180265 · October 29, 2015, 5:57am

I added mentions of the CUDA 7 behavior along with a link to the post GPU Pro Tip: CUDA 7 Streams Simplify Concurrency.

anon84617141 · May 6, 2017, 12:50am

Hi. I want to also overlap/hide the memory copy from host pageable to host pinned (following the model in your last post) but cudaMemcpyAsync with hosttohost does not do. it also destroy the overlapping of memcpyasync from hosttodevice. do you have any idea why?

anon78416780 · July 12, 2017, 1:18pm

Hi Mark,

I have a question, if the time required for the host-to-device transfer, kernel execution, and device-to-host transfer are not the same, as contrary to the above post. Then is there any formula to compute "the optimal number of streams to be created", for example on a Tesla K40 GPU?

To be more precise, if the time to transfer from HtoD (input data) is much higher than DtoH (result data) and the time for kernel execution is even higher than both the memory transfer, then is there any formula to compute the optimal number of streams to be created to achieve maximum performance.

If there is any documentation or papers on this it would be of great help, if you can cite them here.

Thanks in advance

anon22028788 · July 24, 2018, 12:07pm

Hi, thank you for this great article. I have some observation with Quadro K420. When using multiple streams on their own CPU threads and synchronizing(after each copy + kernel + copy) within their own CPU threads, they get serialized in timeline. When I enqueue many copy + kernel + copy per stream and synchronize only once at the end, they all overlap. Why would cuStreamSynchronize(streamHandle) stop other streams overlapping with this? I tried changing synchronization policies such as spin wait, block and yield. They all do same. How can I copy+kernel+copy+synchronize on different threads with their own streams and expect them to overlap in time? It does this on only first sync but can't overlap anything on next syncs.

Maybe this is possible with only hyper-q?

Note: all CPU threads I mentioned are completely free of each other. They don't wait for any specific order. They just issue commands to their own streams as soon as possible(maybe not a good practice but) then expect drivers to handle the overlapping.

- Tested with both WDDM and TCC mode (I have 2 of same card)
- Using driver api equivalent commands (with async suffix).
- If arrays are not pinned, they do overlap but nearly %30 slower overall

- kernel is just vecAdd and data is 1M unsigned char elements per stream (for a,b,c arrays)
- arrays are same but regions are 1M leaping per stream
- 3 streams
- tried with and without #define CUDA_API_PER_THREAD_DEFAULT_STREAM

anon95180265 · July 24, 2018, 8:01pm

It's really hard to help without more detailed information, and it's hard to debug programs in the comments of a blog post. May I suggest you post your question, along with a test program, on either the cuda tag of StackOverflow or on devtalk.nvidia.com forums? The experts on those sites are likely to be able to help find the issue. Thanks!

anon22028788 · July 24, 2018, 10:22pm

Thank you very much. I'll prepare a retriggerable version and post them.

anon22028788 · July 25, 2018, 10:01am

Issue was environment variable for cuda max connections. Setting it to 16 and using TCC mode solved the problem.

anon95180265 · July 25, 2018, 4:58pm

Thanks for sharing your solution!

anon22028788 · July 25, 2018, 6:59pm

Forgot to say this was windows.

In linux, all are ok with or without max connections setting.

Maybe windows is not so focuesd on computing.

anon92283524 · April 2, 2019, 9:28pm

Hi,
Nice article. Definitely a good read.

I had a clarification which may not have been considered by some. This article and method assumes that all the data would fit on the GPU to be run in a single stream (stream0) right? In other words, this method would not work if I were trying to overlap processing and data transfer for a workload which does not fit in the GPU main memory all at once.
Is there a way to signal the next memory preload as soon as the current main memory data is moved to local scratch pad memory? I am imagining this optimization for something like PiRNA which takes an enormous amount of memory to process.
Thanks

anon41745279 · February 2, 2020, 6:40pm

Thanks a lot for the article Mark. I notice that the kernel executed in sequential version uses 4x more threads compared to the kernels executed in the asynchronous versions. However, the each of the kernels in the asynchronous version only spent 1/4 time compared to the kernel in sequential version. I was expecting the they should almost be the same. Could you please explain why? Thank you very much.

user34605 · November 6, 2021, 2:16pm

First of all, I would like to thank for coming up with such an elaborative and wonderful article.

I just had one query. As per this article, "When multiple kernels are issued back-to-back in different streams, the scheduler tries to enable concurrent execution of these kernels and as a result delays a signal that normally occurs after each kernel completion". I would just like to ask whether this property is followed in all the different types of GPU devices or this is a device specific property. If it is a device specific property, do we have any method to find out whether this occurs in any particular GPU device.

Mark_Harris · November 8, 2021, 9:16pm

Hi @user34605 , there is a device property you can query called concurrentKernels. See the cudaDevAttrConcurrentKernels attribute and the cudaDeviceGetAttribute API.

At this time, most modern CUDA GPUs support concurrent kernels. However some Tegra (embedded) GPUs may not (I’m not sure about this), especially small GPUs with only a single multiprocessor.

Glad you liked the post, thanks!

user34605 · January 14, 2022, 11:09am

Currently, I am using a volta V100-32GB GPU. The value of the variable cudaDeviceProp:asyncEngineCount for this GPU is 6.

So, can it allow 6 concurrent copying operation ( 3 in D2H direction and other 3 in H2D direction) ?

user34605 · January 14, 2022, 11:16am

Assume that a GPU device allows 4 concurrent copying operations (2 in D2H direction and 2 in H2D direction).

I issue following two copying operations in the same direction and in different streams:

cudaMemcpyAsync(A, dA, n, cudaMemcpyDeviceToHost,stream[i])

cudaMemcpyAsync(B, dB, n, cudaMemcpyDeviceToHost,stream[i+1])

Will these two copy operations happen concurrently?

user34605 · January 15, 2022, 5:12am

In some of the V100-32GB GPUs, the value of cudaDeviceProp:asyncEngineCount is 5.

What is the interpretation of having an odd number of copy engines? Does it mean that 2 copy engines are used for copying in D2H direction, the other 2 engines are used for copying in H2D direction and the remaining one is used for copying in D2D direction?

Topic		Replies	Views
How to Overlap Data Transfers in CUDA Fortran Technical Blog	0	446	August 25, 2020
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	742	April 6, 2023
Can't get any concurrency on simple vector add across multi-GPU and streams CUDA Programming and Performance	17	6017	April 28, 2012
Can I use streaming to overlap kernels and data transfers in this scenario? CUDA Programming and Performance	13	584	July 5, 2024
cuda stream CUDA Programming and Performance	3	5918	April 6, 2011
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	747	March 8, 2023
cudaMemcpyAsync CUDA Programming and Performance	10	22120	October 16, 2015
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1507	January 23, 2009
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	4033	October 19, 2011
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1773	October 30, 2015

How to Overlap Data Transfers in CUDA C/C++

Related topics