I have a question on the final exercise of the Asynchronous Streaming part of the Fundamentals of Accelerated Computing with CUDA C/C++ course. The exercise is called “Overlap Kernel Execution and Memory Copy Back to Host”. Here is the relevant code from 01-overlap-xfer-solution.cu:
//Create three streams to initialize three arrays
cudaStream_t stream1, stream2, stream3;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaStreamCreate(&stream3);
//Initialize the arrays a, b and c with some data
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream1>>>(3, a, N);
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream2>>>(4, b, N);
initWith<<<numberOfBlocks, threadsPerBlock, 0, stream3>>>(0, c, N);
//Execute addVectorsInto, which computes c = a + b, in 4 segments
//Can the streams in this loop execute addVectorsInto before a, b and c are initialized?
for (int i = 0; i < 4; ++i) {
cudaStream_t stream;
cudaStreamCreate(&stream);
addVectorsInto<<<numberOfBlocks/4, threadsPerBlock, 0, stream>>>(&c[i*N/4], &a[i*N/4], &b[i*N/4], N/4);
cudaMemcpyAsync(&h_c[i*N/4], &c[i*N/4], size/4, cudaMemcpyDeviceToHost, stream);
cudaStreamDestroy(stream);
}
We create three streams to initialize the vectors a, b and c. Then we create four additional streams to perform the operation c = a + b. It is my understanding that CUDA does not guarantee the execution order of the streams, so the four streams that do c = a + b might run before the three streams that fill a, b and c with data. Is this understanding correct? Could the streams execute in parallel such that the vector c ends up with a mix of 0’s and 7’s?