I am looking for some suggestions on good programming patterns in CUDA for avoiding un-coalesced memory access.
The kernel I am trying to optimize is as follows:
__global__ void set_bnd_kernel_0(float *x)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;
x[IX( 0, i )] = x[IX(1, i)];
x[IX(DIM-1, i )] = x[IX(DIM-2, i)];
x[IX( i, 0 )] = x[IX(i, 1)];
x[IX( i, DIM-1)] = x[IX(i, DIM-2)];
}
Basically, I am setting my boundary to the value just inside. Using the Compute Visual Profiler I have determined this is a major hot-spot in my program. I understand that I am making four global memory accesses which are both slow and un-cached. One option I am pursuing is trying to make my accesses tex1Dfetch’s so at least they would be cached. However, then I cannot pass in a pointer which makes my kernel significantly less re-usable. I could put an if statement to select from my textures but then I am going to incur branching penalties.
Just wondering if the pro’s might be able to recommend what I could do in a situation such as this?
Thanks!