Latency in writing to Global Memory

Hi all!

A question about the costs of writing to global memory:

Does writing to global memory cost 400 to 600 clock cycles of latency (like reading), or just the 4 clock cycles to issue the command. If the latter, does this remain true if a kernel makes a series of writes in quick succession? And what if the memory written to is then required to be subsequently read?

I think that apart from the clock cycles, you are not able to read from global memory after you have written into the same location within a kernel.

See this topic.

I guess that if reading does take 400 to 600 clock cycles than a write might take the same amount of cycles. But I am not sure.


I don’t recall if it is spelled out in the guide, but it has been said on the forum that writes are “fire and forget”. When a thread gets to a write, it just passes it off to the memory writing unit and immediately continues on calculating.

Of course, that memory unit can only have a queue so deep, so if you saturate it, threads are bound to start having to wait. Still, it is not difficult to write a kernel that mostly maxes out the memory bandwidth, ~70GB/s