Help with CUDA streams

sony_sony · April 2, 2010, 7:47pm

Hi,
I am new to using cuda streams. I looked at the forum but didnot find a suitable answer.
are the following sections of code same or different. if so why? When I run them, they give different results.

Code 1:
for(int i = 0; i < num_streams; i++)
cudaMemcpyAsync(d_a + i * n / num_streams, a + i * n / num_streams, nbytes / num_streams, cudaMemcpyHostToDevice, gpu_streams[i]);

for(int i = 0; i < num_streams; i++)
    init_array<<<blocks, threads, 0, gpu_streams[i]>>>(d_a + i * n / num_streams, d_c, 1);

for(int i = 0; i < num_streams; i++)
    cudaMemcpyAsync(a + i * n / num_streams, d_a + i * n / num_streams, nbytes / num_streams, cudaMemcpyDeviceToHost, gpu_streams[i]);

Code 2:
for(int i = 0; i < num_streams; i++){
cudaMemcpyAsync(d_a + i * n / num_streams, a + i * n / num_streams, nbytes / num_streams, cudaMemcpyHostToDevice, gpu_streams[i]);
init_array<<<blocks, threads, 0, gpu_streams[i]>>>(d_a + i * n / num_streams, d_c, 1);
cudaMemcpyAsync(a + i * n / num_streams, d_a + i * n / num_streams, nbytes / num_streams, cudaMemcpyDeviceToHost, gpu_streams[i]);
}

Thanks.

laughingrice · April 2, 2010, 9:46pm

Hi,

I am new to using cuda streams. I looked at the forum but didnot find a suitable answer.

are the following sections of code same or different. if so why? When I run them, they give different results.

Code 1:

 for(int i = 0; i < num_streams; i++)

    cudaMemcpyAsync(d_a + i * n / num_streams, a + i * n / num_streams, nbytes / num_streams, cudaMemcpyHostToDevice, gpu_streams[i]);

for(int i = 0; i < num_streams; i++)

    init_array<<<blocks, threads, 0, gpu_streams[i]>>>(d_a + i * n / num_streams, d_c, 1);

for(int i = 0; i < num_streams; i++)

    cudaMemcpyAsync(a + i * n / num_streams, d_a + i * n / num_streams, nbytes / num_streams, cudaMemcpyDeviceToHost, gpu_streams[i]);

Code 2:

for(int i = 0; i < num_streams; i++){

    cudaMemcpyAsync(d_a + i * n / num_streams, a + i * n / num_streams, nbytes / num_streams, cudaMemcpyHostToDevice, gpu_streams[i]);

    init_array<<<blocks, threads, 0, gpu_streams[i]>>>(d_a + i * n / num_streams, d_c, 1);

    cudaMemcpyAsync(a + i * n / num_streams, d_a + i * n / num_streams, nbytes / num_streams, cudaMemcpyDeviceToHost, gpu_streams[i]);

}

Thanks.

They will probably get scheduled a bit differently in terms of concurrent copy and execute (you should get better overlap with the first version from my experience).

My guess is that the copy or kernel overlap somewhere and so the results depends on exact scheduling in terms of what’s in memory (impossible to tell from what you have written). That is, the first version has a better chance of the first kernel running already when the second memcpy occurs, etc.

Topic		Replies	Views
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1490	January 23, 2009
cuda stream CUDA Programming and Performance	3	5915	April 6, 2011
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	1111	February 1, 2022
Multiple streams. CUDA Programming and Performance	1	3472	June 22, 2011
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1833	June 23, 2010
Question about streams CUDA Programming and Performance	1	1032	August 6, 2009
cudaMemcpyAsync CUDA Programming and Performance	10	22082	October 16, 2015
concurrent copy and execution CUDA Programming and Performance	0	1642	November 6, 2009
cuda (Newbie question) when using streams, does the order of the Async calls make a difference? CUDA Programming and Performance	1	580	December 5, 2010
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1551	March 28, 2011

Help with CUDA streams

Related topics