Conditions for CUDA streams to overlap

In the simpleStreams example of the CUDA SDK, there is one kernel that perfectly overlaps with the asynchronous CPU-GPU memory transfer, see figure 1.

However, in my case, for each stream, before the CPU-GPU memory transfer, I have more than one kernel to be called in sequence, interleaved with a cuFFT call. The final result is illustrated in figure 2: the streams do not overlap.

This looks strange to me because the computations and memory transfers within different streams are independent.

How can I know in advance if streams will overlap or setup a stategy to obtain such a result? Is the missing overlap somehow related to the fragmentation of the timeline for the second case (perhaps due to a kernel launch overhead)? Thanks in advance.


Especially on older GPUs, CUDA streams do not behave in a strict data-dependency fashion, because all actions are queued in a single hardware queue, giving rise to false dependencies. The following whitepaper, written in the context of CUDA Fortran, but understandable for CUDA C users, discusses strategies for optimal stream performance under these restrictions:

[url]http://www.pgroup.com/lit/articles/insider/v3n1a4.htm[/url]

From what I understand, Kepler GPUs with HyperQ should allow CUDA stream to behave as expected based on a pure data dependency model, provided the number of CUDA stream does not exceed the number of hardware queue available. I have no first hand experience with this.

Does CUFFT have support for CUDA streams like CUBLAS? I am not up to date on that. If there is no mechanism for assigning CUFFT work to specific streams, CUFFT work would be assigned to the null stream, which has synchronizing properties.

Thank you very much for your usual kind answer.

Concerning CUFFT, it has cufftSetStream() which associates a CUDA stream with a CUFFT plan. I’m using it in my tests.

Regarding Kepler, tomorrow I will try to remotely connect to a Kepler machine and see what happens on that architecture. I will also read the paper you suggested.

I will let you know.

I have made some tests on the Kepler machine, but the situation for the non-overlapping case keeps the same: streams do not overlap.

I have also read the paper you recommended, which I think follows the same guidelines of [url]https://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc[/url] and of the SDK example: overlap of memory transfers and computation.

From the timelines, it seems that, for the non-overlapping case I’m considering, the memory transfers take an unessential time as compared to kernel executions, but nevertheless I cannot explain why even a minimal superposition does not occur. I notice also that the timeline is pretty much fragmented, which perhaps is due to the kernel launch overhead. Why this “idle” time is not exploited for any overlap?

I have another questions, I hope you could kindly give me hints :-)

In all the CUDA examples I have read, the attention is focused on overlapping memory transfers to kernel execution. But, if I have to perform independent computations only, could I expect to overlap kernel executions (provided to properly setup the computational grid)?

Thank you very much in advance.

Concurrent kernels have been supported since the Fermi architecture, and there should be an SDK example called concurrentKernels. I do not have first-hand experience, but from my understanding this helps when there are many small kernels that individually are too small to fill the GPU. For large kernels that can fill the machine by themselves, the only possible performance benefit I would expect is a tiny bit of overlap. I do not know in how far use of the profiler may conflict with concurrent kernel execution and would suggest consulting the documentation.

On a K20 (mind you a Tesla line kepler and probably not the GTX version of kepler) you
might see overlapping of kernels if you run them concurrently and if any of the kernels
don’t fill up the GPU.

For example, I saw performance gain when doing multiple concurrent BLAS operations
on small matrixes (since each operation didn’t fill the entire GPU) however a zgemm
on a 1024x1024 was big enough to fill the GPU and concurrent calculations didnt show
any performance gain.

BTW- for kepler its not the HyperQ feature by itself that would give you additional
conncurency, but rather the fact that the kepler arch has multiple hardware queues
in it as opposed to the single hardware queure, as njuffa mentioned.
The HyperQ feature just make use of this hardware support :)

eyal