Memory coalescing and multiple arrays

I am having some difficulty wrapping my head around how to achieve the best memory coalescing possible. My broader posts have not received any responses, so I decided to try to break it down into smaller chunks and see if that helps…

Consider an application that adds four elements from two different arrays:

float A[SOME_SIZE];

float B[SOME_SIZE];

float C[SOME_SIZE];

for ( int i=0; i<SOME_SIZE-1; ++i )

{

	C[i] = A[i] + A[i+1] + B[i] + B[i+1];

}

When translating this to GPU ode, I understand that I should transfer from global memory to shared memory to allow the “i+1” reads to coalesce. I am uncertain if there is a difference between reading from array A, B, or C.

For example, I recall that you can achieve memory coalescing by reading a multiple of 64K ahead of your current position. Does this same concept apply when reading from multiple arrays. Meaning: Should the array sizes of A, B, and C be multiples of 64K in order to achieve memory coalescing when reading/writing A[i], B[i], and C[i]?

Please let me know if I have muddled my question. I will happily try to revise if I’ve made it difficult to understand…

Thanks!

It sounds like you’re in a similar place of understanding as I was once - and it’s ever so slightly off :)

It is most helpful to think of coalescing among groups of 32 threads. All these 32 adjacent threads should read adjacent memory addresses. When a read/write like this is requested, for a contiguous bit of memory, the access will be much more efficient. Note that the next 32 threads can access a block from a totally different place in memory, as long as it is also contiguous.

That is the basic idea. The description may not be 100% accurate though.

If you have a GTX 200 series card, this is basically it, for coalescing. If you have a GeForce 8 or 9 series card, there’s some extra stuff involved: Among the 32 threads, thread 0 must access the first address in the block, and the block must start at an address that is aligned to a multiple of 64 bytes – or 16 floats. Thread 1 must read baseAddress+1, thread 2 baseAddress+2 and so on. That can sometimes be the complex part.

In short, the rules for full coalescing on the old architecture:

baseAddress for threadGroup % 16 * sizeof(int) = 0

(threadAddress % 16) = (baseArrayAddress + threadArrayAccessAddress % 16) – note, I’m using array indexes, and assuming use of a 4-byte type, to show more easily the fit to threads

so do i understand correctly when i say something like -

if I have to access more than 1 array in a kernel, and each array is > 128 bytes in size, I cannot do coalesced memory reads.

Each thread itself will read from more than 1 memory location. So even if the adjacent 32 threads are reading from adjacent location in each array, I still wont get coalescing in this case because of the multiple reads?

Tiberius’ code from above as example:

[codebox]float A[SOME_SIZE];

float B[SOME_SIZE];

float C[SOME_SIZE];

for ( int i=0; i<SOME_SIZE-1; ++i )

{

C[i] = A[i] + A[i+1] + B[i] + B[i+1];

}[/codebox]

If you give very i value to a thread, I think that should coalesce on each read (A[i], A[i+1], B[i], B[i+1]), modulo the alignment issues mentioned above. You don’t need to use shared memory to coalesce - reading into registers will work just fine. The reason you would want to use shared memory here is simply because element i for one thread is also element i+1 for its neighbour. So (assuming 512 threads per block), each block would load the 513 values from each of A and B into shared memory (and the 513th read is bound to be uncoalesced in each case!), compute its value for C, and then write that back. Your kernel would look something like this:

[codebox]global void AddKernel( float *A, float *B, float *C ) {

shared float A_s[513], B_s[513];

const int tx = threadIdx.x;

A_s[tx] = A[tx];

B_s[tx] = B[tx];

if( tx == 511 ) {

A_s[512] = A[512];

B_s[512] = B[512];

}

__syncthreads();

C[tx] = A_s[tx] + A_s[tx+1] + B_s[tx] + B_s[tx+1];

}[/codebox]As ever, beware of bugs - I haven’t even proved this code correct, let alone tried it :)

Actually in order to prevent the last sample issue (coalesced or not…) you might want to try to simply

open 513 threads instead of 512 where all 513 read the data to smem and then thread 513 just returns.

That should still be coalesced. You’ll need to watch how this improve/degrade your performance and

what it does to the occupancy.

I tried it in the past and it was fine for me as far as i remember…

You had a card which could run 513 threads in a block? Even if it was 256 vs 257, threads are always scheduled by warp - so that extra thread is going to trigger the creation of a whole new warp, only one thread of which will do anything. Perhaps you meant 512 threads to operate on 511 items?

You’re probably right and I used 257 instead of 256. As for the extra wrap, I dont recall seeing a performance

penalty.

No, not at all, because the rules “wrap around”! Thread 0 accesses address 0, but also address 16 and 32 and 64 and …

Thread 1 accesses address 1, while thread 0 accesses 0, and then address 17 while thread 0 accesses 16 … etc.

Or to think of it differently, yes you are right, all arrays must be < 128 bytes, but you can make big arrays with these small arrays!

That’s exactly what’s got me bouncing my head off the table. :-) I have a lot of old architecture cards available to me. I’m able to get some OK speedups, but memory access is clearly swamping the potential performance.

I saw somewhere that jumping in chunks of 64 bytes (although I thought it was 64 KB) allowed memory coalescing, because you were jumping to a new memory block. I was hoping that maybe the individual arrays lied in contiguous blocks of memory, and you needed to size them to a multiple of 64 to coalesce reads between them. This would be an easy change o make for a performance gain. :) (I am currently away from my computers for a week or so, so I can’t just experiment)

So maybe my original thought was correct. If your three arrays were sized to 130 elements each, would reading A[0], B[0], C[0] not coalesce because the start of arrays B and C are not on 16 byte boundaries? Or do the start of arrays always “snap” to 16 byte boundaries?

As a followup: If the first statement is true, do GPUs suffer from similar memory fragmentation issues as traditional RAM?

Coalescing occurs between threads, not between multiple reads of a single thread.

If array A is on a 64-byte boundary, and if you access A[threadIdx.x], then thread 0 will load from A[0], thread 1 will load from A[1], … and thread 15 will load from A[15], which will occur in a single coalesced transaction instead of 16 separate transactions. Then thread 16 (the “0th” thread of the next half-warp) will load from A[16] (which is again on a 64-byte boundary), and so forth up to thread 31 which loads from A[31], and these will also coalesce into a single transaction.

Coalescing occurs when multiple threads are accessing adjacent data and (in the right conditions) the multiple accessing are coalesced into one. Multiple accesses of a single thread cannot coalesce with each other, regardless of alignment or adjacency.

I don’t believe fragmentation is an issue because it is usually not common to allocate and deallocate large numbers of small segments from the host. Within the kernel there is no way to allocate memory.

Memory allocations from the host are always aligned to 64 byte (at least) boundaries.

But if you were to “pack” three arrays of 130 elements each, then while A would be on a 16-byte boundary, B would not, so when reading B[threadIdx.x] threads 0 through 15 will not coalesce into a single read like they do for A. It will occur using 16 reads instead.

Sorry… I didn’t separate my thoughts clearly enough. I know that coalescing does not occur between multiple reads in a single thread. I also should have specifically pointed out that I am concerned for performance on the older hardware - meaning I need to be more concerned about reads not on a 64 byte boundary. I am specifically concerned with what happens when the arrays are not allocated to a multiple of 64 bytes by the user.

In the older hardware, my understanding is that reads from an address that is not a multiple of 64 bytes away from your thread ID are serial - even if no other thread is accessing that memory. It is not sufficient that the data simply be adjacent (or adjacent and unique address).

Returning to my previous example of accessing three arrays sized to 130 floating point elements each (and borrowing code from another post):

__global__ void AddKernel( float *A, float *B, float *C ) {

__shared__ float A_s[130], B_s[130];

const int tx = threadIdx.x;

A_s[tx] = A[tx];

B_s[tx] = B[tx];

__syncthreads();

C[tx] = A_s[tx] + A_s[tx+1] + B_s[tx] + B_s[tx+1];

}

Reads into A obviously coalesce. No problem there. However, it is less clear for reading from B. It seems that B[0] will be 520 bytes away from A[0] (or 4 bytes * 130, or 64 bytes * 8.125) - not a whole multiple of 64 bytes. This takes me to the final section of your post.

It sounds like you are saying that if I allocate 3 separate floating point arrays from the host that the arrays will be padded such that start of the arrays will always be a multiple of 64 bytes away from each other. i.e. 130 elements would effectively be rounded to 144 elements and B[0] would be 576 bytes away from A[0] instead of 520. Am I reading your post correctly?

OK, I haven’t read this entire thread but there seems to be much confusion on the old coalescing rules.

  • Do you have a copy of the old CUDA programming guide version 1.0 or 1.1? It explains the old coalescing rules much better than the new one. I can send you a copy if you can’t find it on the net.

  • Any address allocated by cudaMalloc starts on a 256 byte boundary. This is sufficient for coalescing of any type up to size 16 bytes.

  • By far, the most common memory read pattern is

type data = d_array[something + threadIdx.x]

On 1.0 and 1.1 hardware, this will be coalesced if type is 4, 8, or 16 bytes in size and “something” is a multiple of 16. On compute 1.3, this will be done in a single memory transaction which is twice as fast as the 2 you will get when something is not a multiple of 16 :)

  • If in your code, if you truly need to access elements at d_array[something + threadIdx.x + 1] just bind the array to a texture and use tex1Dfetch. It will get full performance on all hardware.

Yes, that is what I meant. Allocations will always be aligned, but if you chop a single block into “odd” sizes, then only the first chunk will be aligned.

And you are correct, it is not sufficient that the data simply be adjacent.

as for the example code

__global__ void AddKernel( float *A, float *B, float *C ) {

	...

}

The reads and writes will coalesce if A, B, and C are aligned. Meaning they should be allocated separately, or if they are segments cut from a single allocation, they need to be cut into segments whose length is a multiple of 16 elements, to preserve 64-byte alignment of the segments.

Mister Anderson & Jamie K,

Thank you very much for your replies. This has helped to clear up much of my confusion.

I think my next step will be to build a simple application to demonstrate what’s been discussed in this thread. I will use that to play with coalescing.

And since you are on the older hardware, you can use the visual profiler to tell whether your simple application coalesces or not.

Here is the test application I decided to build and play with. On launch, it populates three arrays with random numbers (to avoid working with trivial data like zeroes). It uses these three arrays to perform a contrived computation that must access neighboring memory locations.

I wrote four versions of the computation. The CPU version provides the baseline for comparison. The Unoptimized version is a simple straight-forward implementation. Memory Aligned allocates the arrays in such a way that moving between memory locations (i.e. (x, y, z) to (x, y+1, z)) should be a multiple of 16 bytes. The Tiled implementation launches only y*z threads and each thread steps through from x=0 to the maximum X (the x direction is the longest jump in memory for these arrays). The goal there is to save a memory lookup.

I would appreciate any feedback anyone has to point out other optimizations that I should consider or mistaken ideas I have…

Thanks!

I took a look and it looks like everything should be coalescing. What sort of performance do you get from your implementations? It looks as if the TiledAdder may run faster because you are reusing some of the data instead of loading it each time.

I think you may be able to extend the concept of reusing data loaded from global memory. If I may paraphrase, your computation is basically doing

C[x, y, z] += t*(A[x+1, y, z] - A[x, y, z]) + u*(B[x, y+1, z] - B[x, y, z])

And by keeping A[x+1, y, z] while iterating over x, you save yourself one load from memory. If you were to compute multiple rows (y values) within the loop, you could also decrease the loads from B. If you were to compute say 4 rows at once, you would need 5 loads from B instead of 8.

I am guessing you are doing a sort of gradient calculation, and you would like to calculate the 3D gradient. You could do this without the alignment problems by putting the data in shared memory, say like (paraphrased, not legal code):

shared float SA[BLOCKSIZE+1];
SA[tid] = A[x, y, z];
__syncthreads()
if (tid < BLOCKSIZE-1) {
gradient += v*(SA[tid] - SA[tid+1]);
}

The problem with this is that the last element will not have its data loaded correctly and will require special handling. But the extra complexity and performance penalty may be worth it, compared to other ways of getting the z component of the gradient.

Jamie,

Your paraphrase of the computation is definitely correct. The unoptimized version sees about a 9.5x speed up on my machine. The memory aligned version gets about 14x and the tiled version about 15x. The visual profiler indicates 0.667 occupancy for unoptimized, 1 for memory aligned, and 0.667 for tiled.

I will have to give some thought to the multiple rows idea. I will probably try to implement this over the weekend.

Thanks!

The CUDA programming guide states that::

First, the device is capable of reading 32-bit, 64-bit, or 128-bit words from global

memory into registers in a single instruction. To have assignments such as:

device type device[32];

type data = device[tid];

compile to a single load instruction, type must be such that sizeof(type) is

equal to 4, 8, or 16 and variables of type type must be aligned to sizeof(type)

bytes (that is, have their address be a multiple of sizeof(type)).

According to this, for an array of ints the base address must be aligned to 4 bytes (sizeof(int)). How did you come up with 64 bytes?