I have a situation that I’m looking at which requires that I carefully understand how global memory coalescence works. Let’s say, for brevity, a warp is eight threads, and that eight consecutive words can be read in one cache line. Now, I’ll provide two cases:
First case: a warp is instructed to read all data from the following array (+), process it, and then write results to a similarly shaped array (-):
++++++++ ++++++++ ++++++++ ++++++++ ++++++++ ++++++++ | | | | | | V V V V V V [ Perform some work on the data ] | | | | | | V V V V V V -------- -------- -------- -------- -------- --------
That’s six reads for each thread, and each thread will perform whatever operations its doing on the data six times. Second case: there is only relevant data in about half the slots of the arrays to read and later write. The warp is given instructions to read those specific elements:
0 1 3 6 7 9 10 13
14 18 19 22 25 27 29 32
33 34 37 39 40 43 45 46
In this manner, the warp can read the relevant data from the array, perform only half as many operations (as only half the data is relevant), and then write its results back. Each thread would only perform three operations over the three pieces data it reads.
My question is, I’ve certainly enriched the data on which the threads operate–threads are no longer wasting time with data they don’t need to see. But, will the total effort of reading and writing in both cases in fact be the same? I’m in trouble if the effort of reading and writing would be GREATER in the second case than in the first, but if it’s only the same then I’m fine.