Optimal number of CUDA streams for overlapping computations and data transfers

Hello together,

referring to the following article

https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

I am wondering if there is (theoretically) the need of more than 3 CUDA streams for enabling an optimal overlapping of computations and H2D/D2H data transfers for compute capability > 2.0 (i.e., where we have two DMA engines which enables overlapping of H2D, D2H data transfers and kernel execution) even when not requiring that the time required for the H2D transfer, kernel execution, and D2H transfer are approximately the same

Many thanks in advance.

I don’t think there is a hard & fast rule; it all depends on the details of your use case.

In my experience, two or three streams are often all that is needed to hide most of the host/device copies behind computation kernels. Sometimes more streams than that would be advantageous, e.g. if the kernels are very small and you also need [partial] overlap of the kernels themselves.

Thank you for you quick reply.

Do you know any use case where more than 3 streams are necessary?

I don’t think it is ever “necessary”, just “advantageous”. Maybe this example from the ParallelForAll blog is helpful:

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

Thank you.

My question was possibly not well formulated. What I actually meant: “Do you know any use case where more than 3 streams are advantageous?”

Many thanks.

I am confused now as to what you are looking for. The example I pointed at in #4 shows an exemplary use case involving small kernels, where the kernels can all run concurrently using more than three streams. That is the kind of use case I alluded to in #2 as well.

I do not consider the case of “small kernels”, where kernel computations itself can also be overlapped.

So you are not interested in scenarios where there are small kernels, even scenarios where those small kernels also involve D2H and H2D copies as stated in your original question?

Let us assume for the sake of argument that we could establish conclusively that for a scenario of “large” kernels with D2H and H2D activity one can always achieve “optimal” overlap using <= 3 non-default streams (for some suitable definition of “large” and “optimal”). Where or how would that matter?

Note that I am not saying that it is even possible to conclusively establish the stated premise, because frankly, I don’t know that one can show that.