Understanding Streams I'm confused. :(

In the Cuda C programming guide, rules for simultaneous stream execution are given in one section (3.2.6.5.3)… And then specific cases are talked about in the next section (3.2.6.5.4) that seem to violate the rules that were just given.

The basic case is:
a) Stream[0] memcpy host to device
b) Stream[0] execute a kernel
c) Stream[0] memcpy device to host
d) Stream[1] memcpy host to device
e) Stream[1] execute a kernel
f) Stream[1] memcpy device to host

The docs claim that b) and d) can’t execute simultaneously because d) is executed after c), yet in the previous section (3.2.6.5.3), only allocation/writes to device memory (NOT host memory, other than page-locked allocations) are mentioned as operations causing implicit synchronization.

How can these two sections be reconciled?

The claim that “[b] and [d] can’t execute simultaneously because [d] is executed after [c]” seems correct to me. Streams do not expose a strict data dependency model, since operations sent to the GPU are inserted into a queue. Here [c] has a data dependency on [b], and [d] is queued up behind [c], therefore [d] can’t execute concurrently with [b]. [c] and [d] could happen concurrently on Fermis with dual DMAs. My recommendation in general is to fire off most of the host->device copies at the start of the sequence to get the desired overlap. In this case I would order the operations as follows:

a) Stream[0] memcpy host to device
d) Stream[1] memcpy host to device
b) Stream[0] execute a kernel
e) Stream[1] execute a kernel
c) Stream[0] memcpy device to host
f) Stream[1] memcpy device to host

Now [d] and [b] can overlap, as can [e] and [c]. If these actions are performed in a loop (e.g. a simulation loop) [f] could overlap with [a] on a Fermi with dual DMA engines.

Ok, so despite having multiple “streams”, there’s basically only one queue, and execution of event starts in the order events are queued, then? I had assumed each stream had its own queue.

Thanks so much for the explanation.