I have been optimizing my application. Now I come to the subject of memory coalesce, but I am not very clear about the concept still.
In my application, I have two char arrays in global memory, gX and gY. Each thread needs to read an element from both of them (gX & gY[y]), neither x nor y will be the same as of other threads. It is hard to explain what exactly x and y are, they are basically indices of elements along an anti-diagonal of a 2D matrix which is mapped into a 1D array. So the value of x & y depend on cases (which anti-diagonal and its location).
How can I tell if there exist non-coalesced problem? If I have this problem, how should I fix it?
The criteria for coalescence are well summarized in CUDA Prog. Guide 2.0 p 53. To relax the coalescence in reading, bind your input array to a texture (cached). To verify if you have coalesced access, use the CUda Visual Profiler.
First of all, character arrays inherently provide bad access timings as loads from shared memory are done as 32, 64, 96, or 128 bit and global memory as 32, 64, 128 bit at the architectural level.
Secondly, the values of x & y and the pattern generated on a per thread basis and how that correlates on a thread block level determines if the reads can be coalesced.
Here’s the general rules of thumb for global memory access:
Each thread must be loading from an address such that addr = ((threadId % 16) * fetchsize) (mod fetchsize * 16) holds true. Fetchsize must be 4, 8, or 16 bytes.
Not all threads have to participate in the load, but the formula above must hold true. For example if you have conditional logic such that only threads 2, 5, and 7 need to load 4 bytes of data, thread #2 must load from addr = 2 * 4 (mod 64), thread #5 from addr = 5 * 4 (mod 64), thread #7 from addr = 7 * 4 (mod 64).
Shared memory works as follows:
Each thread in a half warp must load from an address such that addr % (fetchsize * 16) is unique.
Rule #1 can be broken in the case 2 threads need to fetch the exact same addresss, ie: thread #0 & thread #1 read address 0, thread #2 and thread #3 read address 4, etc.
Also one thing you may want to consider is using textures or constants as there are no constraints on access pattern, although locality of reference needs to be considered to properly exploit performance.
Hope this helps a bit, the diagrams in the CUDA Programming manual definitely take a bit of deciphering and experimentation.