Non-coalesced problem?

casybaby · September 23, 2008, 11:37pm

Hi,

I have been optimizing my application. Now I come to the subject of memory coalesce, but I am not very clear about the concept still.

In my application, I have two char arrays in global memory, gX and gY. Each thread needs to read an element from both of them (gX & gY[y]), neither x nor y will be the same as of other threads. It is hard to explain what exactly x and y are, they are basically indices of elements along an anti-diagonal of a 2D matrix which is mapped into a 1D array. So the value of x & y depend on cases (which anti-diagonal and its location).

How can I tell if there exist non-coalesced problem? If I have this problem, how should I fix it?

Thank you so much.

Casy

liv · September 26, 2008, 3:47pm

The criteria for coalescence are well summarized in CUDA Prog. Guide 2.0 p 53. To relax the coalescence in reading, bind your input array to a texture (cached). To verify if you have coalesced access, use the CUda Visual Profiler.

pstach · September 27, 2008, 7:30pm

Here is a less RTFM response than the other guy.

First of all, character arrays inherently provide bad access timings as loads from shared memory are done as 32, 64, 96, or 128 bit and global memory as 32, 64, 128 bit at the architectural level.

Secondly, the values of x & y and the pattern generated on a per thread basis and how that correlates on a thread block level determines if the reads can be coalesced.

Here’s the general rules of thumb for global memory access:

Each thread must be loading from an address such that addr = ((threadId % 16) * fetchsize) (mod fetchsize * 16) holds true. Fetchsize must be 4, 8, or 16 bytes.
Not all threads have to participate in the load, but the formula above must hold true. For example if you have conditional logic such that only threads 2, 5, and 7 need to load 4 bytes of data, thread #2 must load from addr = 2 * 4 (mod 64), thread #5 from addr = 5 * 4 (mod 64), thread #7 from addr = 7 * 4 (mod 64).

Shared memory works as follows:

Each thread in a half warp must load from an address such that addr % (fetchsize * 16) is unique.
Rule #1 can be broken in the case 2 threads need to fetch the exact same addresss, ie: thread #0 & thread #1 read address 0, thread #2 and thread #3 read address 4, etc.

Also one thing you may want to consider is using textures or constants as there are no constraints on access pattern, although locality of reference needs to be considered to properly exploit performance.

Hope this helps a bit, the diagrams in the CUDA Programming manual definitely take a bit of deciphering and experimentation.

-Patrick

Topic		Replies	Views
Problems with coalescing memory accesses CUDA Programming and Performance	4	3776	August 26, 2009
Need some help to understand how to coalesce memory access CUDA Programming and Performance	4	987	June 30, 2010
Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads CUDA Programming and Performance	0	556	December 15, 2012
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2184	May 28, 2009
question about texture reads and coalescading reads CUDA Programming and Performance	3	1843	December 12, 2008
Correct understanding coalesced memory loading? CUDA Programming and Performance	7	5296	July 30, 2008
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1171	March 30, 2009
Need help on non-coalesced access CUDA Programming and Performance	0	1126	May 9, 2009
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11735	March 20, 2009
Help with kernel CUDA Programming and Performance	6	1582	April 23, 2010

Non-coalesced problem?

Related topics