Are gmem transactions from different warps in an SM serialized or run in parallel?

I’ve read in an NVIDIA pdf that you should launch many threads for gmem latency hiding. However, I am confused on whether gmem transactions of newly scheduled warps will block on gmem transactions of previously switched out warps.

Suppose a threadblock has 16 warps, and each warp attempts to read from a different 32-byte line from global memory.

If a gmem transaction is 400 cycles, should this take at least 16x400 cycles to load the data for all warps from gmem, where each warp must wait (in serial) for the other gmem transaction to finish?

Or should this take 400 cycles + 16x(switching latency), where for each warp, the data bus is able to retrieve each of the 32-byte lines from gmem in parallel?

Thanks.

It’s closer to the second description. Memory can be thought of as a pipeline, like many other GPU operations. The “depth” of the pipeline is the number of cycles of latency to expect. At each clock cycle, the memory controllers (the “pipeline”) can accept new memory requests. These requests are placed into the pipeline. Therefore, if warp 0 requested the memory locations 0-128 (bytes, considered warp-wide) and warp 1 requested memory locations 1024-1152 (bytes, considered warp-wide), lets suppose the warp 0 requests entered the pipeline at clock cycle 0, and the warp 1 requests entered the pipeline at clock cycle 1. Let’s suppose the pipeline “depth” is 400 cycles. In clock cycle 400, I would expect bytes 0-128 to “appear” (i.e. be deposited into GPU registers) and in clock cycle 401 I would expect bytes 1024-1152 to “appear”. That is the nature of how a pipeline works. This is of course glossing over many details, but it should help resolve the question you posed.

1 Like