First of all, the GPU from the programmer’s perspective never does anything directly to DRAM. It works out of the L2 cache.
When the GPU wants to read data from DRAM that is not already in the L2 cache, the first thing that happens is that memory “segments” (those 32-byte length items you read about) are read from DRAM into L2 cachelines. Actual requests for that data are then serviced out of the L2 cache line(s).
If subsequent changes to these areas are made, those changes affect the L2 cache first. Later, if an L2 cache line is “evicted”, the entire cacheline gets written. If you think about this process carefully, you will agree that there is no “masking” needed. There is a 1:1 correspondence between any valid cache line and a corresponding memory segment.
So the other case to consider is what happens if the GPU writes to memory before having ever read from it (in that location). AFAIK this behavior is not published, but the possibilities can be inferred, knowing that the L2 cache only interacts with DRAM memory by reading or writing a cache line/memory segment.
- The L2 cache acts as a temporary buffer. Written data in the cacheline is marked as valid (or invalid if not written to). If the cacheline gets evicted, then first a read from DRAM is performed, followed by a harmonization (update of valid values in some L2 cache temporary buffer) followed by a write transaction.
(or)
- The L2 cache, if receiving a global memory write to an invalid line, will perform a read of that segment, and then update value in the L2 cache. If an eviction occurs later, the full cacheline is written out.
Again, in either or any case above, the fundamental quantum of transaction is a 32-byte memory segment, or mulltiple segments. Never anything else, when transacting with DRAM. That is how modern DRAM subsystems work.
L2 cache latency (assuming a hit in L2) should be shorter than DRAM latency. This is true for reads or writes. As njuffa said, writes (and, in my view, reads) are “fire and forget”. The only time the latency from DRAM is ordinarily discoverable is if you have a read followed by a dependent operation. The dependent operation may stall due to, effectively, DRAM latency. A read operation, by itself, never causes a stall. In that respect, I refer to the read operation itself as “fire and forget” also. (njuffa referred to the write operation as “fire and forget”. I agree, and in my view, I would add read operations to that as well, in light of the discussion above, making the distinction between the read operation itself and any subsequent dependent operations. Just my view of the terminology. I’m sure others have other views of the terminology.)