Hey, so I was wondering if anyone had any ideas on how to coalesce memory access for a particular program I’m writing.
The problem is this: Imagine a 2D grid of floats (currently I’m using simple linear array float* to represent this). For every element on this 2D grid, I need to perform a calculation thats based on the 4 “surrounding” neighbors of that element.
Here’s a visual representation:
OOOOOOO
OOOAXOO
OOOXXOO
OOOOOOO
The ‘A’ is the element of interest. For that element, I need to read in the values of ‘A’ as well as the 3 'X’s around it and perform calculations based on those 4 values. I then take my result and place it into a new matrix at the same indexed location as where A was.
This is being done for every element in the matrix/grid, not just for A, so it makes sense to use CUDA for parallel computation. Only problem is, accessing the ‘A’ and the 3 ‘X’ values arn’t exactly coalesced, so the process is very slow. ie. here’s what I’m currently doing:
float value1 = array[tid];
float value2 = array[tid+1];
float value3 = array[tid+width];
float value4 = array[tid+width+1];
Any suggestions how to get the memory accesses to coalesce? Thanks