Hi,
I’m wondering if launching N CUDA streams (literally N kernels in a streaming way) to overlap the computation and the memory copy between host and device to some extent, are different streams sharing resources? like L2 cache or so?
I know at any time, there can be only one kernel being processed for computation. But when Stream 1 is being computed, can Stream 2 kernel do the memory transfer through L2 which is already hold by Stream 1?
Thanks