Benefits (or lack thereof) of using CUDA streams for kernel concurrency

Hi all, I have a question on CUDA streams which was in turn inspired by this question, especially these statements from the answer:

Witnessing actual kernel concurrency [by using CUDA streams] is hard. We need kernels with a relatively long execution time but a relatively low demand on GPU resources.

To my understanding, I appreciate that CUDA streams provide speedups when e.g. performing asynchronous host to device or device to host data transfers. But do they still provide speedups when they are used to achieve kernel concurrency on kernels that fill up the GPU anyway?

For example, if I am trying to achieve kernel concurrency on kernels that are dealing with arrays 1M - 10M+ elements in length, would the use of CUDA streams be pointless because the GPU is already filled up with work?

A real world example, in my application I am using the CUB library to sort 4 arrays each possibly 1M+ elements in length. I was wondering if it would be worthwhile to perform each sort on a separate CUDA stream. I was hoping for some input from this forum before changing my application, as there are a few areas where I could use CUDA streams, but most of them deal with very long arrays that I imagine fill up the GPU.


probably (there may be some small benefit due to the tail effect - just google that)

why not try it?

I have indeed tried to do it but couldn’t get it to work. I browsed some material online but couldn’t find any straightforward “example code” on CUDA streams for concurrent kernels. Plenty on asynchronous H2D/D2H copies though, maybe my search engine algorithm has gotten a bit stuck on those hits. If you happen to know of any learning material for CUDA streams that you think is particularly nice, I would appreciate it.

You mean you couldn’t write a code using cub::DeviceRadixSort which takes a stream parameter to launch 4 sorts, on 4 streams?

I’m partial to this series. Section 7 is the one that is relevant to streams

I did but did not get any speedup, a slow down even, so I’m not very confident I’ve done it properly even though my code output is correct…


Since you didn’t get any speedup, and I said I didn’t expect any speedup, I think we are on the same page - particularly in the absence of any code to inspect whatsoever.

Streams serve a primary purpose in overlap of copy and compute. The aperture for applications that can benefit by overlap of compute with compute (concurrent kernels, etc.) is comparatively tiny, in my experience (unless you are going after 2-3% speedup, which I generally am not). I don’t think I’ve ever run into such a code, although I acknowledge there is a theoretical argument for their existence.