Array offset and coalesced memory access question?

I’m having a bit of trouble offsetting an array while maintaining coalesced memory access. I’ll give a simple example below:

Take this kernel code for example:

__global__ add(float *array, float amount)

{

		int tid = blockIdx.x * blockDim.x + threadIdx.x;

		array[tid] = array[tid] + amount;

}

The following kernel call produces coalesced memory access:

float *d_array;

cudaMalloc((void**)&d_array, 1025*sizeof(float));

cudaMemset(d_array, 0, 1025*sizeof(float));

add<<<4, 256>>>(d_array, 1);

However for some reason, this does NOT produce coalesced memory access:

float *d_array;

cudaMalloc((void**)&d_array, 1025*sizeof(float));

cudaMemset(d_array, 0, 1025*sizeof(float));

add<<<4, 256>>>(d_array+1, 1);

Does anyone know why? Is there any way to offset the input array while keeping coalesced memory access?

cudaMalloc returns addresses that are 16-word aligned so you can access them coalesced in 1.0 hardware. In your second example, you deliberately break this alignment, so your kernel is using unaligned memory access, which works but as you noted is uncoalesced.

The easy solution… if you can’t pass in aligned memory, at least walk the chunks in aligned steps. You may need an extra final read to “clean up” the misaligned final values.

__global__ add(float *array, float amount)

{

		int tid = blockIdx.x * blockDim.x + threadIdx.x;

		if (tid!=0) array[tid-1] = array[tid-1] + amount; // coalesced!

			   else array[1024]=array[1024]+amount; // extra cleanup only for thread 0 which catches the "tail" value. Ugly hardwired index here.

}

In practice that “else” statement needs to be generalized to handle any shift. For example a shift of 3 would need to have 3 threads wrap around and handle the end cases. And the cleanup also depends on the length of the array. In the above example if the array was 1023 long, no cleanup would be necessary.

Easier to code is to just use extra threads as buffers and let any extra ones at the start and the end be idle. This is more like just renumbering your threads to match the problem.

Ahh thanks for the explaination. The add() was just an example, what I’m trying to do is actually a bit different. Do you have any suggestions on how to handle this? see below

So what I’m trying to do is a 2 step algorithm. The first step takes elements from the array 2 at a time, and performs some kind of function on them. The second step, then takes elements from the array 2 at a time, offsetted by 1, and performs another function on them. In other words, the 2nd step’s operation overlaps the first steps operation, and depends on the results from the first step. Its kind of hard to explain, here’s a visual guide:

array = ABCDEFGHI…

1st step:

function(A, B )

function(C, D )

function(E, F )

function(G, H )

//The above functions are the same and are performed in parallel

2nd step:

function(B, C )

function(D, E )

function(F, G )

function(H, I )

//The above functions are the same and are performed in parallel

Simple example:

float *d_array;

cudaMalloc((void**)&d_array, 1025*sizeof(float));

...

// Step 1

add2<<<2, 256>>>((float2) d_array, 1);

// Step 2

add2<<<2, 256>>>((float2) d_array+1, 1);
__global__ add2(float2 *array, int amount){

		int tid = blockIdx.x * blockDim.x + threadIdx.x;

		array.x = array.x + amount;

		array.y = array.x * amount;

}

See in this example, I’ve casted array from float to float2 when calling add2. This way I can load in 2 float values from the array in a coalesced manner and all in 1 execution. It works fine in step 1, the problem arrises in step 2. By offsetting 1 float, I’ve messed up alignment like you said, and the kernel call is no longer coalesced. Got any ideas how I can avoid this problem?

Could I avoid this by creating a second copied array using cudaMemcpy? If so, is cudaMemcpyDeviceToDevice fast enough for me not to worry about it? example below:

float *d_array, *d_array2;

cudaMalloc((void**)&d_array, 1025*sizeof(float));

cudaMalloc((void**)&d_array2, 1025*sizeof(float));

...

// Step 1

add2<<<2, 256>>>((float2) d_array, 1);

// Step 2

cudaMemcpy(d_array2, d_array+1, 1024*sizeof(float), cudaMemcpyDeviceToDevice);

add2<<<2, 256>>>((float2) d_array2, 1);

Any hints on that? Thanks for the reply btw.

The typical trick in that case is to simply load your data into shared memory, not into registers, so you can access it via indices. This is efficient and it also keeps your code clean and flexible.

__shared int buffer[256];

buffer[threadIdx.x]=array[tid]; 

__syncthreads();

if (threadIdx.x < 255) buffer[threadIdx.x] = myfunction(buffer[threadIdx.x], buffer[threadIdx.x+1]);

	else buffer[threadIdx.x] = myfunction(buffer[threadIdx.x], array[tid+1]);  // this one thread peeks into the next chunk of device memory

array[tid]=buffer[threadIdx.x]; // write the results back to device

That extra peek into the next segment is uncoalesced, but it’s one thread reading one value and therefore doesn’t have any (extra) performance penalty.

Though you could rewrite it like:

__shared int buffer[257];

buffer[threadIdx.x]=array[tid];

if (threadIdx.x==0)  buffer[256]=array[tid+256];

__syncthreads();

array[tid]=myfunction(buffer[threadIdx.x], buffer[threadIdx.x+1]);; // write the results back to device