Memory coalescing and multiple arrays

Jamie_K · March 19, 2009, 9:27pm

That is for loading single elements from memory. Float, int must be 4-byte aligned, short must be 2-byte aligned, etc. or else the compiler will have to generate code to jiggle the bits to get it into a register correctly.

The discussion in this thread is about coalescing, where loading from multiple threads will fuse into a single memory transaction, for improved memory bandwidth. The multiple loads must be aligned to 16 elements (for thread 0), so for floats it is 64 bytes. For larger types (double, float4, etc) the alignment requirements for coalescing are bigger.

smokyboy · March 20, 2009, 7:36am

I am not sure I understand what you mean - the following code:

device type device[32];

type data = device[tid];

is about multiple threads reading sequential elements from an array, which should be coalesced into a single memory transaction. So this is exactly the case discussed by this thread.

However, I already found the answer in the old version of the guide as someone suggested, where it explicitly says that the base address must be aligned to 16*sizeof(element). Why they removed this in the new version is beyond any reason, especially since this limitation still holds.

Nvidia guys should really consider supplementing the guide because this is the only true source of information for programmers besides the code examples, which don’t explain many things.

Tiberius · March 20, 2009, 7:07pm

I finally realized how to see coalescing through the visual profiler. I was shocked to see that it reports ~1,600,000 uncoalesced global reads versus ~400,000 coalesced global reads per iteration of the adders (similar numbers for memory aligned and tiled with fewer total reads for tiled).

Any ideas why?

Tiberius · March 20, 2009, 7:29pm

Well, I got it all coalescing.

In the code where I calculate the appropriate array dimensions for coalescing, I had:

yAdjusted = int(ceil( float(requestedYDimension)/4.0f ))*4;

	zAdjusted = int(ceil( float(requestedZDimension)/4.0f ))*4;

Becuase I was trying to get the y and z dimensions to fall on 16 byte boundaries. The following change causes everything to coalesce. It seems that they need to fall on 64 byte boundaries…

yAdjusted = int(ceil( float(requestedYDimension)/16.0f ))*16;

	zAdjusted = int(ceil( float(requestedZDimension)/16.0f ))*16;

I am once again a bit confused. Is it that jumps of 16 elements will coalesce (i.e. 64 bytes for a float array, 128 bytes for a double array)?

btw… My new speedup times make much more sense. I am seeing 15.4x for the memory aligned version and 35.5x for the tiled version.

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	2237	December 9, 2010
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3644	July 27, 2016
Shared memory question CUDA Programming and Performance	27	7678	June 23, 2008
Coalesced read/write memory details More informations about coalesced memory CUDA Programming and Performance	10	12585	March 2, 2008
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2317	May 28, 2009
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10286	June 28, 2009
Coalesced memory access example CUDA Programming and Performance	2	3366	March 28, 2011
Memory coalescing in one thread CUDA Programming and Performance	17	16810	March 31, 2011
32 byte coalesced access is faster than 128 byte coalesced access? CUDA Programming and Performance	3	1165	October 12, 2021
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3629	January 10, 2010

Memory coalescing and multiple arrays

Related topics