Weird behaviour of CUDA streams

Oli11 · June 17, 2010, 9:14am

Hi,

I’m trying to overlap memcpy and kernel execution using streams. If I do it as in the Programmer’s Guide example - first start all in-copy operations, then all kernel calls, then all out-copy operations - it works:

[codebox] for(int i=0;i<stream_count;i++) {

int begin=vector_size*i/stream_count;

int end=vector_size*(i+1)/stream_count;

int size=end-begin;

// copy input vector to gpu //

cudaMemcpyAsync(vector_input_gpu+begin,vector_input_cpu+begi

n,

                          size*sizeof(float),

                          cudaMemcpyHostToDevice,streams[i]);

}

for(int i=0;i<stream_count;i++) {

int begin=vector_size*i/stream_count;

int end=vector_size*(i+1)/stream_count;

int size=end-begin;

// cuda kernel call //

compute<<<30,32,0,streams[i]>>>(vector_input_gpu+begin,

                                                vector_output_gpu+begin,

                                                size);

}

for(int i=0;i<stream_count;i++) {

int begin=vector_size*i/stream_count;

int end=vector_size*(i+1)/stream_count;

int size=end-begin;

// copy output vector from gpu to cpu //

cudaMemcpyAsync(vector_output_cpu+begin,vector_output_gpu+be

gin,

                          size*sizeof(float),

                          cudaMemcpyDeviceToHost,streams[i]);

}[/codebox]

But I wonder why I cannot put everything in one loop - starting first all operations of stream0 then of stream1… - if try to do this it is as slow as the non-streamed version. It seems it doesn’t do any overlap then:

[codebox] for(int i=0;i<stream_count;i++) {

int begin=vector_size*i/stream_count;

int end=vector_size*(i+1)/stream_count;

int size=end-begin;

// copy input vector to gpu //

cudaMemcpyAsync(vector_input_gpu+begin,vector_input_cpu+begi

n,

                          size*sizeof(float),

                          cudaMemcpyHostToDevice,streams[i]);

// cuda kernel call //

compute<<<30,32,0,streams[i]>>>(vector_input_gpu+begin,

                                                vector_output_gpu+begin,

                                                size);

// copy output vector from gpu to cpu //

cudaMemcpyAsync(vector_output_cpu+begin,vector_output_gpu+be

gin,

                          size*sizeof(float),

                          cudaMemcpyDeviceToHost,streams[i]);

}[/codebox]

Can someone explain this to me?

If I just combine in-copy and kernel or just kernel and out-copy in one loop it works fine, but not if I combine all three. I think this is really weird.

Topic		Replies	Views
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	872	July 7, 2017
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	418	October 12, 2021
Kernel Queueing CUDA Programming and Performance	8	9682	June 29, 2009
Concurrent memcpy and kernel execution CUDA Programming and Performance	5	1411	December 9, 2014
Why the cuda kernel and copy do not overlap? CUDA Programming and Performance cuda	2	36	November 5, 2024
cuda (Newbie question) when using streams, does the order of the Async calls make a difference? CUDA Programming and Performance	1	527	December 5, 2010
Help with CUDA streams CUDA Programming and Performance	1	1599	April 2, 2010
About Stream control CUDA Programming and Performance	1	939	March 26, 2009
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1761	June 23, 2010
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	964	February 1, 2022

Weird behaviour of CUDA streams

Related topics