Understanding Streams I'm confused. :(

Dwarg · May 1, 2011, 11:30pm

In the Cuda C programming guide, rules for simultaneous stream execution are given in one section (3.2.6.5.3)… And then specific cases are talked about in the next section (3.2.6.5.4) that seem to violate the rules that were just given.

The basic case is:
a) Stream[0] memcpy host to device
b) Stream[0] execute a kernel
c) Stream[0] memcpy device to host
d) Stream[1] memcpy host to device
e) Stream[1] execute a kernel
f) Stream[1] memcpy device to host

The docs claim that b) and d) can’t execute simultaneously because d) is executed after c), yet in the previous section (3.2.6.5.3), only allocation/writes to device memory (NOT host memory, other than page-locked allocations) are mentioned as operations causing implicit synchronization.

How can these two sections be reconciled?

njuffa · May 2, 2011, 1:51am

The claim that “[b] and [d] can’t execute simultaneously because [d] is executed after [c]” seems correct to me. Streams do not expose a strict data dependency model, since operations sent to the GPU are inserted into a queue. Here [c] has a data dependency on [b], and [d] is queued up behind [c], therefore [d] can’t execute concurrently with [b]. [c] and [d] could happen concurrently on Fermis with dual DMAs. My recommendation in general is to fire off most of the host->device copies at the start of the sequence to get the desired overlap. In this case I would order the operations as follows:

a) Stream[0] memcpy host to device
d) Stream[1] memcpy host to device
b) Stream[0] execute a kernel
e) Stream[1] execute a kernel
c) Stream[0] memcpy device to host
f) Stream[1] memcpy device to host

Now [d] and [b] can overlap, as can [e] and [c]. If these actions are performed in a loop (e.g. a simulation loop) [f] could overlap with [a] on a Fermi with dual DMA engines.

Dwarg · May 2, 2011, 2:01am

Ok, so despite having multiple “streams”, there’s basically only one queue, and execution of event starts in the order events are queued, then? I had assumed each stream had its own queue.

Thanks so much for the explanation.

Topic		Replies	Views
Overlapping execution / data transfer & kernel execution order CUDA Programming and Performance	2	675	December 10, 2015
cuda stream CUDA Programming and Performance	3	5801	April 6, 2011
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008
Stream execution order in CUDA exercise Teaching and Curriculum Support	1	1236	February 3, 2020
Confusion about implicit inter-stream synchronization brought by cudaMemsetAsync CUDA Programming and Performance	5	603	December 30, 2023
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	419	October 12, 2021
Concurrent kernel execution Only working with mapped memory CUDA Programming and Performance	6	5745	July 13, 2011
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1751	October 17, 2009
Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly? CUDA Programming and Performance cuda , kernel	6	37	March 16, 2025
I want to synchronize CUDA streams CUDA Programming and Performance	5	766	January 5, 2024

Understanding Streams I'm confused. :(

Related topics