I was wondering how many latency cycles a write to global memory costs. I apologize if my question is answered in the Programming Guide: I tried to check carefully for an answer, but I might have missed it.
I ask because I heard of a guy who changed an algorithm so that the new version had an only slightly better occupancy but 97% less writes (same # of reads) to global memory than the original, and the former proved to be two full orders of magnitude times faster than the latter.
Absolutely – coalescing reduces the number of transactions required, which is always better, whether loads or stores.
In general, absolute latency of stores should be about the same as loads (hundreds of cycles per transaction), but because stores are “fire and forget”, their latency can be more readily hidden than loads.
Ok, thanks Sarnath and Mark! I would have guessed that it just needed to place the write info on the RAM bus… now at least I have the certainty that writes are fire and forget.
Global Memory writes are “fire and forget” but not totally without latency, AFAIK there’s a 24 cycles latency due to READ-AFTER-WRITE on a register: if you write content of a register on Global Memory, any subsequent read will have to wait (or to be done after) 24 GPU cycles.
So limit your writes to end-of-process or avoid any further reading from the same register to avoid it’s latency.