Hi all, I have a question on CUDA streams which was in turn inspired by this question, especially these statements from the answer:
Witnessing actual kernel concurrency [by using CUDA streams] is hard. We need kernels with a relatively long execution time but a relatively low demand on GPU resources.
To my understanding, I appreciate that CUDA streams provide speedups when e.g. performing asynchronous host to device or device to host data transfers. But do they still provide speedups when they are used to achieve kernel concurrency on kernels that fill up the GPU anyway?
For example, if I am trying to achieve kernel concurrency on kernels that are dealing with arrays 1M - 10M+ elements in length, would the use of CUDA streams be pointless because the GPU is already filled up with work?
A real world example, in my application I am using the CUB library to sort 4 arrays each possibly 1M+ elements in length. I was wondering if it would be worthwhile to perform each sort on a separate CUDA stream. I was hoping for some input from this forum before changing my application, as there are a few areas where I could use CUDA streams, but most of them deal with very long arrays that I imagine fill up the GPU.