coalescing help needed

I had a kernel that did something like this:

mykernel<<<2048/64, 64>>>(input, output);

__global__ void mkernel(input, output)

read in global data

do calculations on the global data

write some global data

val1 = 0.0f;

 if (condition) {

  for (k = 0; k < eleven; k++) {

      for (l = 0; l < nine; l++) {

        val1 += tex2D(k+base_k, l+base_l);

      }

   }

}

write val1 to global memory

So I thought it might be beneficial to split this kernel up to skip the double for loop. So I coded the following:

mykernel1<<<2048/64, 64>>>(input, output);

mykernel2<<<2048,dim3(11,9,1)>>>(input, output);

__global__ void mkernel1(input, output)

read in global data

do calculations on the global data

write some global data

write condition to global memory

__global__ void mkernel2(input, output)

__shared__ float data;

__shared__ int condition;

__shared__ float val[11*9];

unsigned int tid = threadIdx.x + __umul24(threadIdx.y, blockDim.x);

val[tid] = 0.0f;

if (tid==0){

data = global_data[blockIdx.x];

condition = global_condition[blockIdx.x];

}

__syncthreads();

if (condition) {

  unsigned int k = threadIdx.x + base_k;

  unsigned int l = threadIdx.y + base_y;

      val[tid] = tex2D(k,l);

    }

  }

 reduction of val

if(tid==0)

  write val[0] to global memory

}

However, I found out the the combined GPU time was higher than in the first case. Now I need to write some values to global memory and read them in the second kernel, where they were just intermediate values in the first case, but I also saw that almost all my memory accesses are uncoalesced.

When reading the manual I saw that the most probable reason is that 15/16 blocks are reading an adress whose base adress is not 16*sizeof(type) higher than the adress of the array. Does anybody know of a solution other than allocating 16 times as much memory to satisfy that coalescing constraint? I could let the first 16 threads read in data for 16 calculations and then have a

#pragma unroll 16

for (offset = 0; offset < 16; offset ++) {

}

And letting each block calculate 16 values instead of 1, but I find that solution a bit inelegant (I was trying to get rid of for loops ;))