I am wondering if there is (theoretically) the need of more than 3 CUDA streams for enabling an optimal overlapping of computations and H2D/D2H data transfers for compute capability > 2.0 (i.e., where we have two DMA engines which enables overlapping of H2D, D2H data transfers and kernel execution) even when not requiring that the time required for the H2D transfer, kernel execution, and D2H transfer are approximately the same

I don’t think there is a hard & fast rule; it all depends on the details of your use case.

In my experience, two or three streams are often all that is needed to hide most of the host/device copies behind computation kernels. Sometimes more streams than that would be advantageous, e.g. if the kernels are very small and you also need [partial] overlap of the kernels themselves.

I am confused now as to what you are looking for. The example I pointed at in #4 shows an exemplary use case involving small kernels, where the kernels can all run concurrently using more than three streams. That is the kind of use case I alluded to in #2 as well.

So you are not interested in scenarios where there are small kernels, even scenarios where those small kernels also involve D2H and H2D copies as stated in your original question?

Let us assume for the sake of argument that we could establish conclusively that for a scenario of “large” kernels with D2H and H2D activity one can always achieve “optimal” overlap using <= 3 non-default streams (for some suitable definition of “large” and “optimal”). Where or how would that matter?

Note that I am not saying that it is even possible to conclusively establish the stated premise, because frankly, I don’t know that one can show that.