I have a question about coalescing access the global memory. Theoretically according to the program guide, coalesced accessing the consecutive 32 data address by the consecutive 32 threads would greatly reduce the memory transactions and latency, like below
T0 T1 T2…T31
| | | |
D0 D1 D2…D31
Then all 32 access can be coalesced into just one memory transaction. I have two questions generally:
What if consecutive N threads in a group read the same data address, like N=4 below
Will this introduce more overhead and do harm to the efficiency?
What if I assign all the thread blocks to do the same thing, i.e. let consecutive N threads read consecutive N address. In this case, D0 will be read simultaneously by all the T0 from different thread blocks. Will this be a conflict and reduce the access efficiency or just like a broadcasting?
I’m coalescing all threads of a block to read from the array “Data”. Data[threadIdx.x] is the data that threadIdx.x fetched. If I’m also want this thread to access Data[threadIdx.x + blockDim.x] in the program, is the fetch of Data[threadIdx.x + blockDim.x] also automatically coalesced? What I’m worrying about is that it is true for the later access, consecutive threads are reading consecutive addresses, but they might not be at the same time due to the execution of previous part of the program.
I can’t answer question 1. I’ve never tested such access patterns and I’d rather let someone else answer the question.
Concerning question 2, different blocks do not necessarily run simultaneously ; coalescing happens at warp level. These accesses should neither cause conflicts nor will they be broadcast (note : you might benefit from the cache, if you have one on your card, but that would need to be confirmed by someone else).
I’m not sure I understand question 3. If you’re worried about coalescing when different threads of the same block have taken different paths in your code, then you might want to read some documentation about “diverging branches”, because it is really something you want to avoid as much as possible.
As far as coalescing is concerned, this will NOT be a problem. Again, coalescing is a warp-level issue and threads within a warp execute instructions in a lock step fashion, meaning that all the threads within a warp always execute the same instruction at any given time.
The way divergent branches within a warp are handled, is that, if at a given if statement two threads of the same warp take different paths, both paths will be executed sequentially by all the threads within this warp and instructions executed by threads that are not concerned with a given path will have no effect.
For question 2, I know that different blocks do not necessarily run simultaneously, because there might be multiple blocks handled and kept swapping on just one multiprocessor. But what if a data address is to be access simultaneously by all the threads with the same index, e.g. for all the threadInx.x=1, let them access data[1] in global memory at the same time, whether they are in different blocks of one multiprocessor, or even on different multiprocessors.
For question 3, I got your point that as long as there is no branching within a warp, all the operation, like read access, calculation, and write back should all be finished in lock-step fashion. Then if all threads within a warp are working like this, is my understanding correct that no synchronization is needed within a warp after execution? Synchronizations are possibly only needed between warps when you assign all warps of a block do the same thing.
BTW: what do you mean by “instructions executed by threads that are not concerned with a given path will have no effect.”
As far as I know, having different blocks read from the same memory location simultaneously should not be a problem.
Concerning question 3 bis, threads within a warp will always execute instructions in a lock-step fashion, even if there are divergent branches. Synchronization within a warp is not necessary with one exception. I read yesterday on this forum (I can’t remember the name of the post) that if one thread writes to a memory location and later on reads from that same memory location, recent versions of the compiler might optimize the code and keep the value in register to avoid a global memory transaction. In this particular case, only that thread will have the correct value for that memory location. This behavior can be prevented by declaring the variable to be “volatile”.
What I meant by “instructions executed by threads that are not concerned with a given path will have no effect.” is the following :
If you are doing something like this (awful example) :
then even though only the first half of the first warp is concerned with the instruction within the if statement, all the threads within that first warp will execute the instruction but only the first half will actually write to the memory location. The only consequence is that within a warp, threads with “shorter” paths cannot be ahead of other threads. You can’t expect either to have half of you warp execute one branch while the other one is executing another branch. This is why divergent branches are to be avoided as much as possible, because you are basically wasting cycles.