I am trying to find the sums of all different 11 x 11 matrixes in a large matrix. I do this with the code below (for simplicity I removed all code for offset in indexing and boundary conditions).
__global__ void testadd(float *zz, int width, int size, float *zout)
{
__shared__ float zhelp[64];
int x= blockIdx.x * (blockDim.x-size+1);
int y= blockIdx.y * blockDim.y + threadIdx.y;
float z =0;
int i;
for (i=0; i<size; i++)
z+=zz[(y+i)*width+x];
zhelp[threadIdx.x]=z;
__syncthreads();
for(i=1; i<size; i++)
z+=zhelp[threadIdx.x+i];
if(x<width)
zout[y*width+x]=z;
}
int size = 11;
dim3 dimBlock2(64,1);
int hulp=dimBlock2.x-size+1;
dim3 dimGrid2((width+hulp-1)/hulp,(height+dimBlock2.y-1)/dimBlock2.y);
testadd<<<dimGrid2, dimBlock2>>>(z, width, size, zout);
Each tread calculates the sum of a column (size 11) and stores it in shared memory. Next each tread fetches all data it needs from shared memory to calculate the sum of one small matrix. In device emulation this all works fine, however in real I have some problems. The problems occur at the first cells of the blocks in the x direction (it is the data that is used by the previous block in the x direction as well, the first block in the x direction works correct). Somehow there is no sync and the treads fetch no data from shared memory. However this only happens for about 40% off all blocks. When I make my grid size 1 in the x direction and just place a for loop over my kernel for all grids in the x direction I have no problems at all. How can I make my program work without the for loop?