Conditions for CUDA streams to overlap

JFSebastian · June 7, 2013, 7:38pm

In the simpleStreams example of the CUDA SDK, there is one kernel that perfectly overlaps with the asynchronous CPU-GPU memory transfer, see figure 1.

However, in my case, for each stream, before the CPU-GPU memory transfer, I have more than one kernel to be called in sequence, interleaved with a cuFFT call. The final result is illustrated in figure 2: the streams do not overlap.

This looks strange to me because the computations and memory transfers within different streams are independent.

How can I know in advance if streams will overlap or setup a stategy to obtain such a result? Is the missing overlap somehow related to the fragmentation of the timeline for the second case (perhaps due to a kernel launch overhead)? Thanks in advance.

njuffa · June 7, 2013, 8:03pm

Especially on older GPUs, CUDA streams do not behave in a strict data-dependency fashion, because all actions are queued in a single hardware queue, giving rise to false dependencies. The following whitepaper, written in the context of CUDA Fortran, but understandable for CUDA C users, discusses strategies for optimal stream performance under these restrictions:

[url]http://www.pgroup.com/lit/articles/insider/v3n1a4.htm[/url]

From what I understand, Kepler GPUs with HyperQ should allow CUDA stream to behave as expected based on a pure data dependency model, provided the number of CUDA stream does not exceed the number of hardware queue available. I have no first hand experience with this.

Does CUFFT have support for CUDA streams like CUBLAS? I am not up to date on that. If there is no mechanism for assigning CUFFT work to specific streams, CUFFT work would be assigned to the null stream, which has synchronizing properties.

JFSebastian · June 7, 2013, 9:01pm

Thank you very much for your usual kind answer.

Concerning CUFFT, it has cufftSetStream() which associates a CUDA stream with a CUFFT plan. I’m using it in my tests.

Regarding Kepler, tomorrow I will try to remotely connect to a Kepler machine and see what happens on that architecture. I will also read the paper you suggested.

I will let you know.

JFSebastian · June 8, 2013, 9:12pm

I have made some tests on the Kepler machine, but the situation for the non-overlapping case keeps the same: streams do not overlap.

I have also read the paper you recommended, which I think follows the same guidelines of [url]https://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc[/url] and of the SDK example: overlap of memory transfers and computation.

From the timelines, it seems that, for the non-overlapping case I’m considering, the memory transfers take an unessential time as compared to kernel executions, but nevertheless I cannot explain why even a minimal superposition does not occur. I notice also that the timeline is pretty much fragmented, which perhaps is due to the kernel launch overhead. Why this “idle” time is not exploited for any overlap?

I have another questions, I hope you could kindly give me hints :-)

In all the CUDA examples I have read, the attention is focused on overlapping memory transfers to kernel execution. But, if I have to perform independent computations only, could I expect to overlap kernel executions (provided to properly setup the computational grid)?

Thank you very much in advance.

njuffa · June 9, 2013, 12:15am

Concurrent kernels have been supported since the Fermi architecture, and there should be an SDK example called concurrentKernels. I do not have first-hand experience, but from my understanding this helps when there are many small kernels that individually are too small to fill the GPU. For large kernels that can fill the machine by themselves, the only possible performance benefit I would expect is a tiny bit of overlap. I do not know in how far use of the profiler may conflict with concurrent kernel execution and would suggest consulting the documentation.

eyalhir74 · June 9, 2013, 3:44am

On a K20 (mind you a Tesla line kepler and probably not the GTX version of kepler) you
might see overlapping of kernels if you run them concurrently and if any of the kernels
don’t fill up the GPU.

For example, I saw performance gain when doing multiple concurrent BLAS operations
on small matrixes (since each operation didn’t fill the entire GPU) however a zgemm
on a 1024x1024 was big enough to fill the GPU and concurrent calculations didnt show
any performance gain.

BTW- for kepler its not the HyperQ feature by itself that would give you additional
conncurency, but rather the fact that the kepler arch has multiple hardware queues
in it as opposed to the single hardware queure, as njuffa mentioned.
The HyperQ feature just make use of this hardware support :)

eyal

Topic		Replies	Views
Overlapping kernel execution and data transfer CUDA Programming and Performance	9	3454	May 10, 2017
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2258	January 18, 2023
cufft concurrent streams CUDA Programming and Performance	2	1889	August 20, 2014
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1149	September 26, 2018
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	895	July 7, 2017
What could cause kernel execution to not overlap on different streams? CUDA Programming and Performance	8	2159	June 1, 2017
CUDA stream performance CUDA Programming and Performance	5	2389	July 23, 2013
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2253	October 26, 2016
streams not overlapping CUDA Programming and Performance	1	1549	May 23, 2011
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	891	July 27, 2017

Conditions for CUDA streams to overlap

Related topics