I’ve read in an NVIDIA pdf that you should launch many threads for gmem latency hiding. However, I am confused on whether gmem transactions of newly scheduled warps will block on gmem transactions of previously switched out warps.
Suppose a threadblock has 16 warps, and each warp attempts to read from a different 32-byte line from global memory.
If a gmem transaction is 400 cycles, should this take at least 16x400 cycles to load the data for all warps from gmem, where each warp must wait (in serial) for the other gmem transaction to finish?
Or should this take 400 cycles + 16x(switching latency), where for each warp, the data bus is able to retrieve each of the 32-byte lines from gmem in parallel?