I’ve never tried using streams to overlap memory operations with kernel execution…
I wonder, if I have N blocks (where N is considerably large - 30-40 thousands) and each block
uses M bytes of global mem (where M is let’s say 1-2 Kb)
then, would I get any performance benefit if I split my kernel launch into K streams ?
i.e., smth like:
for i =0 to K memcpyAsync(stream[i], HostToDev) grid = dim3(N / K) for i =0 to K // launch K kernels asynchronously kernel<<< threads, grid, 0, stream[i] >>>(); for i =0 to K memcpyAsync(stream[i], DevToHost)