I know the following about global memory usage:
Between threads of different blocks, there is no way to guarantee a global memory write from thread A is correctly read by thread B.
Between threads of the same block, we can guarantee thread A’s write is correctly read by thread B if we insert a syncthreads between the write and the read. (p. 28 of the CUDA 2.1 Programming Guide)
If the write and read occur within the same thread (so thread A and thread B are the same), do we need a syncthreads between the write and the read? I would assume no, but I am a bit worried that the write is asynchronous and takes 400-600 cycles to complete. The read will therefore be issued before that write is completed. Will the memory controller handle this?
I think that often the compiler will not issue another read (unless the array is marked volatile) and just keeps the data in a register.
Also keep in mind that a single thread might not run again for 400-600 cycles. There are at least 25 cycles necessary for a warp to complete 1 instruction. So if you have 16 warps active it takes already at least 400 cycles before the same warp needs to run again. (aka latency hiding)
In 2.2beta there is a new instruction to wait until all memory transactions are finished (threadfence)
My understanding of the pipelining is incomplete, however I was under the impression that the following was the case:
We have a ~25 stage pipeline. Every 4 cycles the pipeline advances 4 stages and one warp has its next instruction issued (the 32 instructions now occupy stages 0,1,2, and 3 in the pipeline accross the 8 scalar processors of the MP).
If this is the case then it will take 16*4 = 64 cycles before our original warp which did the memory write begins to issue its memory read. Therefore I am still concerned about global memory read after write even if I have 16 warps active. Please let me know if I’ve gone awry here somewhere.
Regardless, I am still interested in a more definitive answer which does not depend on maintaining a particular number of active warps, if anyone knows.
I would guess the opposite. I would expect a single thread to be guaranteed consistency when reading and writing to global memory, otherwise it would be extremely difficult to guarantee correctness of anything without putting __syncthreads() between every access. It would be a very sad state of affairs if single threaded code did not execute correctly without timing considerations.
With multiple threads accessing the same resource, it’s reasonable for the timing and even the order to be undetermined. My understanding of threadfence is that it enforces ordering, so all writes that occur before a threadfence become appparent to other threads before any of the writes that occur after threadfence.