Multiple streams.

for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
MyKernel<<<100, 512, 0, stream[i]>>>
(outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);


I saw that code. Is it better than the one stream case?
If it is better, then why is it?
I wonder the multiple stream run concurrently.


With one stream i guess that you mean to execute on the default stream 0. In that case there are several benefits of using streams.

  1. Async. mem copy.

  2. Overlap memcpy and kernel execution.

  3. Concurrent execution of kernels

  4. Because the calls seen from the CPU doesn’t require synchronization you can have the GPU fully occupied with work.

The order in which the streams are launched is preserved.

One drawback of streams i can think of is the duplicated memory usage you need.