cuda stream


I am trying to understand how streams are written in CUDA.

1-Basically I am looking for an example, that shows this. Also, I found some code as follows:

cudaStream_t stream1, stream2;
cudaMemcpyAsync( dst, src, size, dir, stream1 );
kernel<<<grid, block, 0, stream2>>>(…);

But I am not able to understand it. In the above stream1 and stream2 are the kernels ?

2- I understand that Streams = sequence of operations that execute in order on GPU. Also streams can be useful because of its ability to concurrently execute a kernel and a memcopy. Suppose I have to do three operations O1, O2, O3, one by one on a chunk of data (in sequence). Now how shall I proceed? Shall I write three different kernels? A pseudo-code here will be helpful for me to understand the concept.

Thanks for your time ,


No they are streams. Streams could be thought of as command pipelines. You can have several open to the same device at once and push asynchronous commands down different streams. On GPUs with concurrent copy/execution capability, the driver will work out when commands on different streams can be overlapped and execute them accordingly. In your example code, one stream is being used for an asynchronous copy, and the other to run a kernel. Both can run at the same time to improve computational efficiency and hide PCI-e latency.

That would be the usual approach. If at the end of the O1-O2-O3 sequence, your host code needed to compute something based on the intermediate results of O2, then it would make sense to use streams and do an asynchronous copy back to the host while O3 was still running. If the calculations needs the result of O3 on the host, then you have no choice to wait until O3 is finished, and streams probably wouldn’t be of any benefit.

The principles are discussed in section 3.2.6 of the programming guide.

Even in the normal case, when there are no streams, there exists a default stream.

Stream basically means that ALL operations initiated are served on FIFO basis.

Thus a code sequence like:


cudaMemcpy(TO_GPU)			  ---------------- REF_1

kernel1 <<< >>>

cudaMemcpy(FROM_GPU)		  ---------------- REF_2

cudaMemcpy(TO_GPU_FOR_NEXT_KERNEL) --- REF_3

kernel2 <<< >>>

will execute FIFO manner…

So, the REF_3 cudaMemcpy will have to wait for all pevious operations (including REF_1, REF_2) to complete… This is Normal…

But there are some cards out there which can support concurrent kernel execution and memcpy… For such cards, “kernel1” can execute and at the same time memcpy in “REF_3” can execute…

So, There needs to be a way to express this parallelism without disturbing older semantics…

And thus, CUDA Streams was born… HTH


If I put a cudaThreadSynchronize() after kernel and cudaMemcpyAsync() in REF 2, then this GPU-CPU transfer will wait for the kernel or it can do asynchronus memcpy even now ??