cudaStreamSynchronize(a_stream) simpleStreams


Can someone explain why there is no cudaStreamSynchronize in the “time execution with nstreams streams” part of the code.

I was expecting

for(int k = 0; k < nreps; k++)


        // asynchronously launch nstreams kernels, each operating on its own portion of data

        for(int i = 0; i < nstreams; i++)

            init_array<<<blocks, threads, 0, streams[i]>>>(d_a + i * n / nstreams, d_c, niterations);

// asynchronoously launch nstreams memcopies.  Note that memcopy in stream x will only

        //   commence executing when all previous CUDA calls in stream x have completed

        for(int i = 0; i < nstreams; i++)

            cudaMemcpyAsync(a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]);

///////// THIS PART MISSING ?????

       for(int i = 0; i < nstreams; i++)





You don’t need it in this case. There is a cudaEventSynchronize() call on stream 0 (which is synchronous) outside of the loop, and that acts as a barrier until the GPU becomes idle.


There is also the simpleMultiCopy example where is some workaround on synchronisation in the processWithStream function.

Not easy to follow.