Latency for writes to global memory

Hey all,

I was wondering how many latency cycles a write to global memory costs. I apologize if my question is answered in the Programming Guide: I tried to check carefully for an answer, but I might have missed it.

I ask because I heard of a guy who changed an algorithm so that the new version had an only slightly better occupancy but 97% less writes (same # of reads) to global memory than the original, and the former proved to be two full orders of magnitude times faster than the latter.

Thanks,

Claudio

Stores are usually less costlier than “loads”.,… beacause “loads” require DATA to arrive before they can proceed…

Stores just need to just dispatch the data to the store-unit and can go around happily with next instruction.

This is true of many architectures…

In CUDA, I learnt that it is “fire and forget” mode (Search for this clause and u will find the thread that discusses it)

but coalesced stores are faster is what i heard, I think. It is better to coalesce them as well.

All these info are from my memory - i could be wrong.

Absolutely – coalescing reduces the number of transactions required, which is always better, whether loads or stores.

In general, absolute latency of stores should be about the same as loads (hundreds of cycles per transaction), but because stores are “fire and forget”, their latency can be more readily hidden than loads.

Mark

Thanks for endorsing Mark.

Atleast, I now know that I am not running around with some mis-conceptions :)

Ok, thanks Sarnath and Mark! I would have guessed that it just needed to place the write info on the RAM bus… now at least I have the certainty that writes are fire and forget.

:thumbup:

Global Memory writes are “fire and forget” but not totally without latency, AFAIK there’s a 24 cycles latency due to READ-AFTER-WRITE on a register: if you write content of a register on Global Memory, any subsequent read will have to wait (or to be done after) 24 GPU cycles.

So limit your writes to end-of-process or avoid any further reading from the same register to avoid it’s latency.