How do GPUs "handle" writes?

I’m a bit confused about how CUDA handles writes. Much of the information that I’ve come across about global memory access is about reads. I know that global memory reads are coalesced in 128-byte cachelines. My question though is about writes.

Are writes also coalesed? If writes are coalesced, would the coalesced write complete sooner if only small proportion of the bytes in the cacheline are written to global memory?

For example, assuming that only 4 of the 32 threads in a warp execute an if-statement that writes an integer to global memory. Only 4x4 = 16 bytes of the 128 bytes of the cacheline would be written to global memory. Would this write complete sooner than the case where all 32 threads in the warp executed the if-statement, and 128 bytes (the entire cacheline) were written to global memory?

OTOH, if writes are not coalesced would it be correct to assume that the latency will depend on the number of write requests issued?

Yes, writes are coalesced.

Coalescing occurs before the transaction(s) is(are) given to the load/store unit to carry out. Therefore your question about what would happen sooner doesn’t make a lot of sense to me. By the time the load/store unit gets it, it is a transaction of a particular size, regardless of which bytes within that particular size were actually written to.

There is no such concept as a small proportion of the bytes in a cacheline being written to global memory. DRAM transactions are of a fixed size, called a segment. You don’t get to write individual bytes (just like you don’t get to read individual bytes).

“For example, assuming that only 4 of the 32 threads in a warp execute an if-statement that writes an integer to global memory”

OK.

" Only 4x4 = 16 bytes of the 128 bytes of the cacheline would be written to global memory. "

Wrong.

To first order, writes to global memory are “fire & forget”, i.e. there is almost never a need to be concerned about their latency.

Thanks. I thought only atomics had fire and forget behavior.

Thanks txbob. This answers a lot questions but raises a few more questions.

From what I’ve learned (from watching nearly every talk on the subject, (maybe I need to watch them again) and my initial ready of the Wikipedia page on GDDRs, the minimal coalesced size is 32 bytes. That being the case, what happens when threads write shorts or bools to global memory? Is there some sort of masking operation that protects unmodified parts of the segment or does the GPU perform writes to such (smaller) data types in a less efficient way that leads to a performance penalty?

First of all, the GPU from the programmer’s perspective never does anything directly to DRAM. It works out of the L2 cache.

When the GPU wants to read data from DRAM that is not already in the L2 cache, the first thing that happens is that memory “segments” (those 32-byte length items you read about) are read from DRAM into L2 cachelines. Actual requests for that data are then serviced out of the L2 cache line(s).

If subsequent changes to these areas are made, those changes affect the L2 cache first. Later, if an L2 cache line is “evicted”, the entire cacheline gets written. If you think about this process carefully, you will agree that there is no “masking” needed. There is a 1:1 correspondence between any valid cache line and a corresponding memory segment.

So the other case to consider is what happens if the GPU writes to memory before having ever read from it (in that location). AFAIK this behavior is not published, but the possibilities can be inferred, knowing that the L2 cache only interacts with DRAM memory by reading or writing a cache line/memory segment.

  1. The L2 cache acts as a temporary buffer. Written data in the cacheline is marked as valid (or invalid if not written to). If the cacheline gets evicted, then first a read from DRAM is performed, followed by a harmonization (update of valid values in some L2 cache temporary buffer) followed by a write transaction.

(or)

  1. The L2 cache, if receiving a global memory write to an invalid line, will perform a read of that segment, and then update value in the L2 cache. If an eviction occurs later, the full cacheline is written out.

Again, in either or any case above, the fundamental quantum of transaction is a 32-byte memory segment, or mulltiple segments. Never anything else, when transacting with DRAM. That is how modern DRAM subsystems work.

L2 cache latency (assuming a hit in L2) should be shorter than DRAM latency. This is true for reads or writes. As njuffa said, writes (and, in my view, reads) are “fire and forget”. The only time the latency from DRAM is ordinarily discoverable is if you have a read followed by a dependent operation. The dependent operation may stall due to, effectively, DRAM latency. A read operation, by itself, never causes a stall. In that respect, I refer to the read operation itself as “fire and forget” also. (njuffa referred to the write operation as “fire and forget”. I agree, and in my view, I would add read operations to that as well, in light of the discussion above, making the distinction between the read operation itself and any subsequent dependent operations. Just my view of the terminology. I’m sure others have other views of the terminology.)

afair, L2 cache latencies were around 100 cycles in Maxwell. Even L1/sharedmem latencies were 30 cycles at best

read latencies are easily discoverable since you need more parallel warps to cover them, or more registers waiting for data that will be used in the future

I’ve modified my answer. Here is a summary of some measured latencies:

http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf

thanks, txbob, it’s great summary

I concur. I haven’t replied yet because I’m still “meditating” on the answer. It’s that good.

Does the “fire and forget” apply to non-coalesed writes?
I am thinking of the case where block i writes one word to element i of an output
only array.
Unless the scheduler puts these blocks together (and there is no reason it should?)
then global memory will get a whole load of random updates.

Does this matter?

Best wishes
Bill

My 10,000ft view of CUDA performance tuning is: (1) Worry about getting data to the functional units (2) Worry about using the functional units efficiently. In that order.

But such rules of thumb (likewise: don’t worry about writes, don’t worry about local branching) are just that. For specific cases, one should always consult the CUDA profiler.

coalescing only applies to the results of a single load or store instruction, with respect to a single warp

There is no concept of coalescing between warps, or blocks, or between separate instructions issued to the same warp.