Stream execution order in CUDA exercise

anders.floderus · August 13, 2019, 3:34pm

I have a question on the final exercise of the Asynchronous Streaming part of the Fundamentals of Accelerated Computing with CUDA C/C++ course. The exercise is called “Overlap Kernel Execution and Memory Copy Back to Host”. Here is the relevant code from 01-overlap-xfer-solution.cu:

//Create three streams to initialize three arrays
cudaStream_t stream1, stream2, stream3;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaStreamCreate(&stream3);

//Initialize the arrays a, b and c with some data
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream1>>>(3, a, N);
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream2>>>(4, b, N);
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream3>>>(0, c, N);

//Execute addVectorsInto, which computes c = a + b, in 4 segments
//Can the streams in this loop execute addVectorsInto before a, b and c are initialized?
for (int i = 0; i < 4; ++i) {
	cudaStream_t stream;
	cudaStreamCreate(&stream);

	addVectorsInto<<<numberOfBlocks/4, threadsPerBlock, 0, stream>>>(&c[i*N/4], &a[i*N/4], &b[i*N/4], N/4);
	cudaMemcpyAsync(&h_c[i*N/4], &c[i*N/4], size/4, cudaMemcpyDeviceToHost, stream);
	cudaStreamDestroy(stream);
}

We create three streams to initialize the vectors a, b and c. Then we create four additional streams to perform the operation c = a + b. It is my understanding that CUDA does not guarantee the execution order of the streams, so the four streams that do c = a + b might run before the three streams that fill a, b and c with data. Is this understanding correct? Could the streams execute in parallel such that the vector c ends up with a mix of 0’s and 7’s?

mnicely · February 3, 2020, 6:24pm

Your assumption is correct. There is no mechanism that would keep the 4 stream in the for loop execute after the 3 initWith kernels. It would be smart to use cudaStreamSynchronize on those 3 kernels.

On a V100, you might not see an issue, but you might on an older, slower card.

Topic		Replies	Views
confusions about CUDA streams CUDA Programming and Performance	5	805	July 30, 2017
My streams are not running concurrently CUDA Programming and Performance	7	1775	March 6, 2018
Question about CUDA streams CUDA Programming and Performance	8	735	November 8, 2019
Understanding Streams I'm confused. :( CUDA Programming and Performance	2	728	May 2, 2011
Overlapping execution / data transfer & kernel execution order CUDA Programming and Performance	2	675	December 10, 2015
CUDA stream concurrency problem CUDA Programming and Performance	1	380	October 9, 2019
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1889	June 17, 2010
cuda stream CUDA Programming and Performance	3	5801	April 6, 2011
Computation and PCIe tranfers overlaping with callbacks and events. CUDA Programming and Performance	7	914	July 7, 2016
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008

Stream execution order in CUDA exercise

Related topics