using streams for async memory operations Is it worth splitting kernel launch into several streams ?


I’ve never tried using streams to overlap memory operations with kernel execution…

I wonder, if I have N blocks (where N is considerably large - 30-40 thousands) and each block

uses M bytes of global mem (where M is let’s say 1-2 Kb)

then, would I get any performance benefit if I split my kernel launch into K streams ?

i.e., smth like:

for i =0 to K

  memcpyAsync(stream[i], HostToDev)

grid = dim3(N / K)

for i =0 to K // launch K kernels asynchronously

  kernel<<< threads, grid, 0, stream[i] >>>();

for i =0 to K

  memcpyAsync(stream[i], DevToHost)


After some playing around with some of the new features in CUDA 2.2 (like cudaHostAlloc and cudaHostAllocMapped), I’ve found that using this new feature has overall better performance than manually programming the asynchronous copying or doing everything in one sweep.

Then again this is just one case among many.

ah thanks, I really didn’t pay attention that the new cudaHostAlloc != cudaMallocHost

I want to evaluate both approaches on my kernel and then will post the results