A question about the costs of writing to global memory:
Does writing to global memory cost 400 to 600 clock cycles of latency (like reading), or just the 4 clock cycles to issue the command. If the latter, does this remain true if a kernel makes a series of writes in quick succession? And what if the memory written to is then required to be subsequently read?
I don’t recall if it is spelled out in the guide, but it has been said on the forum that writes are “fire and forget”. When a thread gets to a write, it just passes it off to the memory writing unit and immediately continues on calculating.
Of course, that memory unit can only have a queue so deep, so if you saturate it, threads are bound to start having to wait. Still, it is not difficult to write a kernel that mostly maxes out the memory bandwidth, ~70GB/s