How to Overlap Data Transfers in CUDA C/C++

Originally published at:

In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other…

Thanks for the great article.
I suspect your cudaMemcpyAsync() invocations in the first example are missing the "kind" argument.

Thanks for noticing! I fixed this.

Nice article. Small suggestion: the behavior of the default stream with respect to synchronization has changed throught different CUDA versions sync the article has been written (e.g. no more implicit sync with CUDA 7). It would be useful to add a small recap of the behavior according to the version.

I added mentions of the CUDA 7 behavior along with a link to the post GPU Pro Tip: CUDA 7 Streams Simplify Concurrency.

Hi. I want to also overlap/hide the memory copy from host pageable to host pinned (following the model in your last post) but cudaMemcpyAsync with hosttohost does not do. it also destroy the overlapping of memcpyasync from hosttodevice. do you have any idea why?

Hi Mark,

I have a question, if the time required for the host-to-device transfer, kernel execution, and device-to-host transfer are not the same, as contrary to the above post. Then is there any formula to compute "the optimal number of streams to be created", for example on a Tesla K40 GPU?

To be more precise, if the time to transfer from HtoD (input data) is much higher than DtoH (result data) and the time for kernel execution is even higher than both the memory transfer, then is there any formula to compute the optimal number of streams to be created to achieve maximum performance.

If there is any documentation or papers on this it would be of great help, if you can cite them here.

Thanks in advance

Hi, thank you for this great article. I have some observation with Quadro K420. When using multiple streams on their own CPU threads and synchronizing(after each copy + kernel + copy) within their own CPU threads, they get serialized in timeline. When I enqueue many copy + kernel + copy per stream and synchronize only once at the end, they all overlap. Why would cuStreamSynchronize(streamHandle) stop other streams overlapping with this? I tried changing synchronization policies such as spin wait, block and yield. They all do same. How can I copy+kernel+copy+synchronize on different threads with their own streams and expect them to overlap in time? It does this on only first sync but can't overlap anything on next syncs.

Maybe this is possible with only hyper-q?

Note: all CPU threads I mentioned are completely free of each other. They don't wait for any specific order. They just issue commands to their own streams as soon as possible(maybe not a good practice but) then expect drivers to handle the overlapping.

- Tested with both WDDM and TCC mode (I have 2 of same card)
- Using driver api equivalent commands (with async suffix).
- If arrays are not pinned, they do overlap but nearly %30 slower overall

- kernel is just vecAdd and data is 1M unsigned char elements per stream (for a,b,c arrays)
- arrays are same but regions are 1M leaping per stream
- 3 streams
- tried with and without #define CUDA_API_PER_THREAD_DEFAULT_STREAM

It's really hard to help without more detailed information, and it's hard to debug programs in the comments of a blog post. May I suggest you post your question, along with a test program, on either the cuda tag of StackOverflow or on forums? The experts on those sites are likely to be able to help find the issue. Thanks!

Thank you very much. I'll prepare a retriggerable version and post them.

Issue was environment variable for cuda max connections. Setting it to 16 and using TCC mode solved the problem.

Thanks for sharing your solution!

Forgot to say this was windows.

In linux, all are ok with or without max connections setting.

Maybe windows is not so focuesd on computing.

Nice article. Definitely a good read.

I had a clarification which may not have been considered by some. This article and method assumes that all the data would fit on the GPU to be run in a single stream (stream0) right? In other words, this method would not work if I were trying to overlap processing and data transfer for a workload which does not fit in the GPU main memory all at once.
Is there a way to signal the next memory preload as soon as the current main memory data is moved to local scratch pad memory? I am imagining this optimization for something like PiRNA which takes an enormous amount of memory to process.

Thanks a lot for the article Mark. I notice that the kernel executed in sequential version uses 4x more threads compared to the kernels executed in the asynchronous versions. However, the each of the kernels in the asynchronous version only spent 1/4 time compared to the kernel in sequential version. I was expecting the they should almost be the same. Could you please explain why? Thank you very much.