First Set of Commands in Set of Streams not Asynchronous?

I am currently using Streams to asynchronously transfer data between device and host while working smaller partitions of my problem.

For example:

If I have 5 streams and 20 partitions of my data set, each stream works on 4 partitions each. Streams are queued with sets of input transfer, compute, output transfer as below.

Stream 1 Host -> Device transfer, Compute, Device -> Host transfer; Host -> Device transfer, Compute, Device -> Host transfer … …
Stream 2 Host -> Device transfer, Compute, Device -> Host transfer; Host -> Device transfer, Compute, Device -> Host transfer … …
Stream 3 Host -> Device transfer, Compute, Device -> Host transfer; Host -> Device transfer, Compute, Device -> Host transfer … …

This works good and well at hiding memory latencies except for the first set of first Device to Host transfer which waits until the last compute is finished (i.e waits until stream 5’s compute is finished to copy stream 1’s data back to host).

This is at the moment just an annoyance and maybe wastes 2% of my exectution time, but to me its unexpected nonetheless and irritating me.

Has anyone else expetienced this? Any solutions?

Also the attached pic mite give you a better idea of what is going on.