CUDA stream

L.Allen · April 10, 2010, 4:02am

what would I benifit from the code in the following?

cudaStream_t stream[2];

for(int i = 0 ; i < 2; i++)
cudaStreamCreate(stream);

float* hostPtr;
cudaMallocHost((void**)&hostPtr,2*size);

for(int i = 0 ; i < 2 ; ++i)
cudaMemcpyAsync(inputDevPtr + isize, hostPtr+isize, size, cudaMemcpyHostToDevice,stream[i]);

for(int i = 0 ; i < 2 ; i ++)
myKernel<<<1000,512,0,stream[i]>>>(outputDevPtr + isize, inputDevPtr + isize,size);

for(int i = 0 ; i < 2 ; i++)
cudaMemcpyAsync(hostPtr + isize,outputDevPtr+isize,size,cudaMemcpyDeviceToHost,strea
mi);

cudaThreadSynchronize();

If one stream works on these code, would the performance of having only one stream be worse than two streams ?

what I think is after all these streams have to share the hardware ?

andradx · April 11, 2010, 8:59pm

what would I benifit from the code in the following?

cudaStream_t stream[2];

for(int i = 0 ; i < 2; i++)
   cudaStreamCreate(stream);
float* hostPtr;

cudaMallocHost((void**)&hostPtr,2*size);

for(int i = 0 ; i < 2 ; ++i)
 cudaMemcpyAsync(inputDevPtr + i*size, hostPtr+i*size, size, cudaMemcpyHostToDevice,stream[i]);
for(int i = 0 ; i < 2 ; i ++)
myKernel<<<1000,512,0,stream[i]>>>(outputDevPtr + i*size, inputDevPtr + i*size,size);
for(int i = 0 ; i < 2 ; i++)
cudaMemcpyAsync(hostPtr + i*size,outputDevPtr+i*size,size,cudaMemcpyDeviceToHost,strea
mi);

cudaThreadSynchronize();

If one stream works on these code, would the performance of having only one stream be worse than two streams ?

what I think is after all these streams have to share the hardware ?

The advantage is that you can reduce up to 1/Nth the cost of copying data to the device if you have N streams, as the calls are asynchronous, if you plot the actions occuring on your device on a timeline you’d have something like:

[copy0HtoD | kernel0/copy1HtoD | kernel1/copy0DtoH | copy1DtoH]

where [name]/[name] refers to concurrent execution and # to

(This might not be your case, dependes on whether you have memory bound or computing bound throughput on your application, but for the example above suppose memory copies and kernel execution take the same amount of time)

They do share the hardware but between kernel launching and returning to host, data can be copied without penalty (not quite sure if the penalty is absolute zero) to and from the device.

Topic		Replies	Views
Multiple streams. CUDA Programming and Performance	1	3409	June 22, 2011
:rolleyes: wath Gain using stream? code with stream take more time to execute, wath is the gain of s CUDA Programming and Performance	3	7181	February 12, 2010
Streams and CPU CUDA Programming and Performance	1	1034	September 27, 2013
About Stream control CUDA Programming and Performance	1	941	March 26, 2009
Streams and multiprocessor usage? CUDA Programming and Performance	3	2899	September 20, 2008
Help with CUDA streams CUDA Programming and Performance	1	1599	April 2, 2010
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13132	March 30, 2011
My streams are not running concurrently CUDA Programming and Performance	7	1794	March 6, 2018
Problem using streams Can't get more than one stream to work CUDA Programming and Performance	3	4663	October 8, 2008
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	800	September 11, 2016

CUDA stream

Related topics