Hey, so I was wondering if anyone had any ideas on how to coalesce memory access for a particular program I’m writing.

The problem is this: Imagine a 2D grid of floats (currently I’m using simple linear array float* to represent this). For every element on this 2D grid, I need to perform a calculation thats based on the 4 “surrounding” neighbors of that element.

Here’s a visual representation:

OOOOOOO

OOOAXOO

OOOXXOO

OOOOOOO

The ‘A’ is the element of interest. For that element, I need to read in the values of ‘A’ as well as the 3 'X’s around it and perform calculations based on those 4 values. I then take my result and place it into a new matrix at the same indexed location as where A was.

This is being done for every element in the matrix/grid, not just for A, so it makes sense to use CUDA for parallel computation. Only problem is, accessing the ‘A’ and the 3 ‘X’ values arn’t exactly coalesced, so the process is very slow. ie. here’s what I’m currently doing:

float value1 = array[tid];

float value2 = array[tid+1];

float value3 = array[tid+width];

float value4 = array[tid+width+1];

Any suggestions how to get the memory accesses to coalesce? Thanks