I had a kernel that did something like this:
mykernel<<<2048/64, 64>>>(input, output);
__global__ void mkernel(input, output)
read in global data
do calculations on the global data
write some global data
val1 = 0.0f;
if (condition) {
for (k = 0; k < eleven; k++) {
for (l = 0; l < nine; l++) {
val1 += tex2D(k+base_k, l+base_l);
}
}
}
write val1 to global memory
So I thought it might be beneficial to split this kernel up to skip the double for loop. So I coded the following:
mykernel1<<<2048/64, 64>>>(input, output);
mykernel2<<<2048,dim3(11,9,1)>>>(input, output);
__global__ void mkernel1(input, output)
read in global data
do calculations on the global data
write some global data
write condition to global memory
__global__ void mkernel2(input, output)
__shared__ float data;
__shared__ int condition;
__shared__ float val[11*9];
unsigned int tid = threadIdx.x + __umul24(threadIdx.y, blockDim.x);
val[tid] = 0.0f;
if (tid==0){
data = global_data[blockIdx.x];
condition = global_condition[blockIdx.x];
}
__syncthreads();
if (condition) {
unsigned int k = threadIdx.x + base_k;
unsigned int l = threadIdx.y + base_y;
val[tid] = tex2D(k,l);
}
}
reduction of val
if(tid==0)
write val[0] to global memory
}
However, I found out the the combined GPU time was higher than in the first case. Now I need to write some values to global memory and read them in the second kernel, where they were just intermediate values in the first case, but I also saw that almost all my memory accesses are uncoalesced.
When reading the manual I saw that the most probable reason is that 15/16 blocks are reading an adress whose base adress is not 16*sizeof(type) higher than the adress of the array. Does anybody know of a solution other than allocating 16 times as much memory to satisfy that coalescing constraint? I could let the first 16 threads read in data for 16 calculations and then have a
#pragma unroll 16
for (offset = 0; offset < 16; offset ++) {
}
And letting each block calculate 16 values instead of 1, but I find that solution a bit inelegant (I was trying to get rid of for loops ;))