Problem summing the values of all pixels in a column of a 32 bit (tiff) picture.

I’m new to this forum and I want to say thank to all of you for your discussions and suggestions.
I’m trying to sum “all pixels in a column” of a 32-bit (tiff) grayscale picture (vertical integration column-by-column).
This is my code:

extern "C" __global__ void invert(float* picturePixels, float* ColIntegration, int width, int height)
    int x = threadIdx.x+blockIdx.x*blockDim.x;
    int y = threadIdx.y+blockIdx.y*blockDim.y;
    if (x < width && y < height)
        int index = y*width+x;
        ColIntegration[x] += picturePixels[index];

I obtain a good result only if the number of blocks = 1; Increasing the blocks to 2, 4, 16, I obtain a wrong result (divided by 2, 4, 16 respectively).
I think it is a “reduction problem” but I do not understand where is my mistake.
Please could someone help me?
Thank in advance.
Sincerely yours,
Pier Andrea.

Its a classic race condition. Threads with a same x but a different y run simultaniously. Thus more than thread wants to increment ColIntegration at the same time. If you used an atomic float add this problem would not occur.

You have multiple blocks writing to ColIntegration at the same time, and they are stepping on each other.

If you want a simple fix for the code, you could replace your last line with:

atomicAdd(ColIntegration + x, picturePixels[index]);

However this will not be particularly high performance.

Instead, partition your code (and threadblocks) so that each threadblock is handling a whole column, or multiple columns. Then each thread block will be writing to its own portion of ColIntegration, and threadblocks will not be stepping on each other.

If you have a large number of columns in the image, you may be able to write a higher-performance kernel that assigns one thread per column, and each thread executes a for-loop, summing the results of the column. You then only need a 1D grid and threadblock structure.

Hi Fiepchen and txbob,
thank you very much for your suggestions. I’ve modified the code and now it works very well.
I’ve tryed both the solutions, the atomic addition and the for cycle managed by one thread per column. I think that this last solution is more elegant and performant.
This is my modified kernel, just in case someone needs to use it.

extern "C" __global__ void IntegrateColumns(float* picturePixels, float* ColIntegration, int width, int height)
    int index = threadIdx.x+blockIdx.x*blockDim.x;
    int y;
     if (index < width)
        for(y=0;y<height;y++) ColIntegration[index] += data[index+width*y];

Thank you again.
Pier Andrea.