I have a question about coalescing access the global memory. Theoretically according to the program guide, coalesced accessing the consecutive 32 data address by the consecutive 32 threads would greatly reduce the memory transactions and latency, like below
T0 T1 T2…T31
| | | |
D0 D1 D2…D31
Then all 32 access can be coalesced into just one memory transaction. I have two questions generally:
- What if consecutive N threads in a group read the same data address, like N=4 below
T0 T1 T2 T3 T4 T5 T6 T7 T8…T28 T29 T30 T31
| | | | | | | | | | | | |
D0 D0 D0 D0 D0 D1 D1 D1 D1…D7 D7 D7 D7
Will this introduce more overhead and do harm to the efficiency?
What if I assign all the thread blocks to do the same thing, i.e. let consecutive N threads read consecutive N address. In this case, D0 will be read simultaneously by all the T0 from different thread blocks. Will this be a conflict and reduce the access efficiency or just like a broadcasting?
I’m coalescing all threads of a block to read from the array “Data”. Data[threadIdx.x] is the data that threadIdx.x fetched. If I’m also want this thread to access Data[threadIdx.x + blockDim.x] in the program, is the fetch of Data[threadIdx.x + blockDim.x] also automatically coalesced? What I’m worrying about is that it is true for the later access, consecutive threads are reading consecutive addresses, but they might not be at the same time due to the execution of previous part of the program.