In the simpleStreams example of the CUDA SDK, there is one kernel that perfectly overlaps with the asynchronous CPU-GPU memory transfer, see figure 1.
However, in my case, for each stream, before the CPU-GPU memory transfer, I have more than one kernel to be called in sequence, interleaved with a cuFFT call. The final result is illustrated in figure 2: the streams do not overlap.
This looks strange to me because the computations and memory transfers within different streams are independent.
How can I know in advance if streams will overlap or setup a stategy to obtain such a result? Is the missing overlap somehow related to the fragmentation of the timeline for the second case (perhaps due to a kernel launch overhead)? Thanks in advance.