Memory coalescing and multiple arrays

That is for loading single elements from memory. Float, int must be 4-byte aligned, short must be 2-byte aligned, etc. or else the compiler will have to generate code to jiggle the bits to get it into a register correctly.

The discussion in this thread is about coalescing, where loading from multiple threads will fuse into a single memory transaction, for improved memory bandwidth. The multiple loads must be aligned to 16 elements (for thread 0), so for floats it is 64 bytes. For larger types (double, float4, etc) the alignment requirements for coalescing are bigger.

I am not sure I understand what you mean - the following code:

device type device[32];

type data = device[tid];

is about multiple threads reading sequential elements from an array, which should be coalesced into a single memory transaction. So this is exactly the case discussed by this thread.

However, I already found the answer in the old version of the guide as someone suggested, where it explicitly says that the base address must be aligned to 16*sizeof(element). Why they removed this in the new version is beyond any reason, especially since this limitation still holds.

Nvidia guys should really consider supplementing the guide because this is the only true source of information for programmers besides the code examples, which don’t explain many things.

I finally realized how to see coalescing through the visual profiler. I was shocked to see that it reports ~1,600,000 uncoalesced global reads versus ~400,000 coalesced global reads per iteration of the adders (similar numbers for memory aligned and tiled with fewer total reads for tiled).

Any ideas why?

Well, I got it all coalescing.

In the code where I calculate the appropriate array dimensions for coalescing, I had:

yAdjusted = int(ceil( float(requestedYDimension)/4.0f ))*4;

	zAdjusted = int(ceil( float(requestedZDimension)/4.0f ))*4;

Becuase I was trying to get the y and z dimensions to fall on 16 byte boundaries. The following change causes everything to coalesce. It seems that they need to fall on 64 byte boundaries…

yAdjusted = int(ceil( float(requestedYDimension)/16.0f ))*16;

	zAdjusted = int(ceil( float(requestedZDimension)/16.0f ))*16;

I am once again a bit confused. Is it that jumps of 16 elements will coalesce (i.e. 64 bytes for a float array, 128 bytes for a double array)?

btw… My new speedup times make much more sense. I am seeing 15.4x for the memory aligned version and 35.5x for the tiled version.