Can't Get Shared Memory To Work

I have a code I’m trying to optimize to use shared memory, but it’s been a headache to say the least. The example code is below which cannot be compiled by itself because it depends on other data structures that are initialized elsewhere in a much larger code.

[codebox]#define SHRDX(i,j,k) ( shrdMem[(i)165 + (j)*5 + (k)] )

global void launch_bounds(MAX_THREADS_PER_BLOCK,MIN_BLOCKS_PER_SM) flux_x( fpprec *state , fpprec *flux_x , int numx , int startInd ) {

int i, j, k, ii, jj, kk, l

extern __shared__ fpprec shrdMem[];

ii = blockIdx.x*blockDim.x+threadIdx.x;

jj = blockIdx.y*blockDim.y+threadIdx.y;

kk = blockIdx.z*blockDim.z+threadIdx.z;

i = ii + GS;

j = jj + GS;

if (i >= numx+GS || j >= NYC+GS) return;

for (k = kk+GS; k < NZC+GS; k+=gridDim.z*blockDim.z) {

	//Store horizontal stencil in shared memory

    for (l = 0; l < 5; l++) SHRDX(threadIdx.x+GS,threadIdx.y,l) = STATE(i,j,k,l); //Load the middle domain


	if (SHRDX(threadIdx.x+GS,threadIdx.y,0) != STATE(i,j,k,0)) printf("Problem: i:%d, j:%d, ii:%d, jj:%d\n", i, j, threadIdx.x, threadIdx.y);




fpprec by the way is a typedef to switch between single and double precision. Sometimes this will run fine for a few time steps, and then suddenly the retrieved values no longer match shared memory. And this happens at spurious points. I’ve checked to make sure I’m not overflowing the allocated amount of shared memory 100 times. Is there something any of you can spot that’s wrong? Thanks so much.