Stream execution order in CUDA exercise

I have a question on the final exercise of the Asynchronous Streaming part of the Fundamentals of Accelerated Computing with CUDA C/C++ course. The exercise is called “Overlap Kernel Execution and Memory Copy Back to Host”. Here is the relevant code from 01-overlap-xfer-solution.cu:

//Create three streams to initialize three arrays
cudaStream_t stream1, stream2, stream3;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaStreamCreate(&stream3);

//Initialize the arrays a, b and c with some data
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream1>>>(3, a, N);
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream2>>>(4, b, N);
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream3>>>(0, c, N);

//Execute addVectorsInto, which computes c = a + b, in 4 segments
//Can the streams in this loop execute addVectorsInto before a, b and c are initialized?
for (int i = 0; i < 4; ++i) {
	cudaStream_t stream;
	cudaStreamCreate(&stream);

	addVectorsInto<<<numberOfBlocks/4, threadsPerBlock, 0, stream>>>(&c[i*N/4], &a[i*N/4], &b[i*N/4], N/4);
	cudaMemcpyAsync(&h_c[i*N/4], &c[i*N/4], size/4, cudaMemcpyDeviceToHost, stream);
	cudaStreamDestroy(stream);
}

We create three streams to initialize the vectors a, b and c. Then we create four additional streams to perform the operation c = a + b. It is my understanding that CUDA does not guarantee the execution order of the streams, so the four streams that do c = a + b might run before the three streams that fill a, b and c with data. Is this understanding correct? Could the streams execute in parallel such that the vector c ends up with a mix of 0’s and 7’s?

Your assumption is correct. There is no mechanism that would keep the 4 stream in the for loop execute after the 3 initWith kernels. It would be smart to use cudaStreamSynchronize on those 3 kernels.

On a V100, you might not see an issue, but you might on an older, slower card.