Not all indexes hit when 2d indexing using blockIdx and threadIdx

Hi there,
I am trying to implement a rather simple averaging during transformation of an image. I already successfully implemented the transformation, but now I have to process this resulting image by summing up all pixels of all 5x5 pixels rectangles. My Idea was to increment a counter for each such 5x5 block whenever a pixel in this block is set. However, these block-counters are by far not incremented often enough. So for debugging I checked how often any pixel of such a block is hit at all:

    int x = (blockIdx.x*blockDim.x) + threadIdx.x;
    int y = (blockIdx.y*blockDim.y) + threadIdx.y;

    resultArray [0]++; 

The kernel is called like this:
dim3 threadsPerBlock(8, 8);
dim3 grid(targetAreaRect_px._uiWidth / threadsPerBlock.x, targetAreaRect_px._uiHeight / threadsPerBlock.y);
CudaTransformAndAverageImage << < grid, threadsPerBlock >> > (pcPreRasteredImage_dyn, resultArray );

I would expect resultArray [0] to contain 25 after kernel execution, but it only contains 1. Is this due to some optimization by the CUDA compiler?
Any help is welcome! Thank You in advance,

Use global atomics instead of the ++ operator for incrementing your resultArray.

cross posted: