global memory read after write

stencil · March 24, 2009, 4:27am

I know the following about global memory usage:
Between threads of different blocks, there is no way to guarantee a global memory write from thread A is correctly read by thread B.
Between threads of the same block, we can guarantee thread A’s write is correctly read by thread B if we insert a syncthreads between the write and the read. (p. 28 of the CUDA 2.1 Programming Guide)

My question:
If the write and read occur within the same thread (so thread A and thread B are the same), do we need a syncthreads between the write and the read? I would assume no, but I am a bit worried that the write is asynchronous and takes 400-600 cycles to complete. The read will therefore be issued before that write is completed. Will the memory controller handle this?

E.D_Riedijk · March 24, 2009, 6:11am

I think that often the compiler will not issue another read (unless the array is marked volatile) and just keeps the data in a register.

Also keep in mind that a single thread might not run again for 400-600 cycles. There are at least 25 cycles necessary for a warp to complete 1 instruction. So if you have 16 warps active it takes already at least 400 cycles before the same warp needs to run again. (aka latency hiding)

In 2.2beta there is a new instruction to wait until all memory transactions are finished (threadfence)

stencil · March 24, 2009, 4:39pm

My understanding of the pipelining is incomplete, however I was under the impression that the following was the case:

We have a ~25 stage pipeline. Every 4 cycles the pipeline advances 4 stages and one warp has its next instruction issued (the 32 instructions now occupy stages 0,1,2, and 3 in the pipeline accross the 8 scalar processors of the MP).

If this is the case then it will take 16*4 = 64 cycles before our original warp which did the memory write begins to issue its memory read. Therefore I am still concerned about global memory read after write even if I have 16 warps active. Please let me know if I’ve gone awry here somewhere.

Regardless, I am still interested in a more definitive answer which does not depend on maintaining a particular number of active warps, if anyone knows.

E.D_Riedijk · March 25, 2009, 6:19am

I messed up indeed in the calculations.

If I would have to guess, then the result of the read is not certain. That is probably whey threadfence is introduced in 2.2beta

Jamie_K · March 25, 2009, 3:27pm

I would guess the opposite. I would expect a single thread to be guaranteed consistency when reading and writing to global memory, otherwise it would be extremely difficult to guarantee correctness of anything without putting __syncthreads() between every access. It would be a very sad state of affairs if single threaded code did not execute correctly without timing considerations.

With multiple threads accessing the same resource, it’s reasonable for the timing and even the order to be undetermined. My understanding of threadfence is that it enforces ordering, so all writes that occur before a threadfence become appparent to other threads before any of the writes that occur after threadfence.

Topic		Replies	Views
__syncthreads() and global memory CUDA Programming and Performance	1	2500	December 1, 2008
Read-After-Write for a single cuda thread ? (and vice versa) Potential race conditions/issues for gl CUDA Programming and Performance	1	15067	June 21, 2011
Conditional write to global memory CUDA Programming and Performance	3	3530	September 4, 2007
CUDA Memory Consistency CUDA Programming and Performance	23	55943	March 8, 2007
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	8093	July 20, 2007
Memory Consistency and __syncthreads() CUDA Programming and Performance	2	7330	July 5, 2011
Clarification on Memory Access issue CUDA Programming and Performance	1	3763	September 9, 2009
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2775	April 9, 2012
Access Global memory from kernel CUDA Programming and Performance cuda	2	706	December 15, 2020
Question regarding global memory write protection CUDA Programming and Performance	1	781	October 1, 2009

global memory read after write

Related topics