CUDA stream

what would I benifit from the code in the following?

cudaStream_t stream[2];

for(int i = 0 ; i < 2; i++)

float* hostPtr;

for(int i = 0 ; i < 2 ; ++i)
cudaMemcpyAsync(inputDevPtr + isize, hostPtr+isize, size, cudaMemcpyHostToDevice,stream[i]);

for(int i = 0 ; i < 2 ; i ++)
myKernel<<<1000,512,0,stream[i]>>>(outputDevPtr + isize, inputDevPtr + isize,size);

for(int i = 0 ; i < 2 ; i++)
cudaMemcpyAsync(hostPtr + isize,outputDevPtr+isize,size,cudaMemcpyDeviceToHost,strea


If one stream works on these code, would the performance of having only one stream be worse than two streams ?

what I think is after all these streams have to share the hardware ?

The advantage is that you can reduce up to 1/Nth the cost of copying data to the device if you have N streams, as the calls are asynchronous, if you plot the actions occuring on your device on a timeline you’d have something like:

[copy0HtoD | kernel0/copy1HtoD | kernel1/copy0DtoH | copy1DtoH]

where [name]/[name] refers to concurrent execution and # to

(This might not be your case, dependes on whether you have memory bound or computing bound throughput on your application, but for the example above suppose memory copies and kernel execution take the same amount of time)

They do share the hardware but between kernel launching and returning to host, data can be copied without penalty (not quite sure if the penalty is absolute zero) to and from the device.