I’ve never tried using streams to overlap memory operations with kernel execution…
I wonder, if I have N blocks (where N is considerably large - 30-40 thousands) and each block
uses M bytes of global mem (where M is let’s say 1-2 Kb)
then, would I get any performance benefit if I split my kernel launch into K streams ?
i.e., smth like:
for i =0 to K
memcpyAsync(stream[i], HostToDev)
grid = dim3(N / K)
for i =0 to K // launch K kernels asynchronously
kernel<<< threads, grid, 0, stream[i] >>>();
for i =0 to K
memcpyAsync(stream[i], DevToHost)
After some playing around with some of the new features in CUDA 2.2 (like cudaHostAlloc and cudaHostAllocMapped), I’ve found that using this new feature has overall better performance than manually programming the asynchronous copying or doing everything in one sweep.