Memory access performance

I would like to share an optimization issue, in my opinion connected with global memory access. I’m currently implementing row-column integral image algorithm (first step is summing all columns and then all rows). Pixels are placed row by row.

block.x = 512; block.y = 1;
grid.x = width / block.x + 1; grid.y = 1;
sumColumns<<<grid, block>>>(&out[widthIntegral + 1], width, height, widthIntegral);

block.x = 1; block.y = 512;
grid.y = height / block.y + 1; grid.x = 1;
sumRows<<<grid, block>>>(&out[widthIntegral + 1], width, height, widthIntegral);

and the functions:

__global__ void sumRows(float *out, int width, int height, int strideOut) {
	int indexOut = blockDim.y * blockIdx.y + threadIdx.y;
	
	if (indexOut < height) {
		float sum = 0.0f;
		int counter = 0;
		indexOut *= strideOut;
		for (; counter < width; counter++, indexOut++) {
			sum += out[indexOut];
			out[indexOut] = sum; // pierwsza komorka zerowa
		}
	}
}

__global__ void sumColumns(float *out, int width, int height, int strideOut) {
	int indexOut = blockDim.x * blockIdx.x + threadIdx.x;
	
	if (indexOut < width) {
		float sum = 0.0f;
		int counter = 0;
		//endOut = indexOut + height * strideOut;
		for (; counter < height; counter++, indexOut += strideOut) {
			sum += out[indexOut];
			out[indexOut] = sum;
		}
	}
}

The problem is the time of computing sumRows compared with sumColumns. The latter is much more effective, but the algorithm is symetrical. Swaping the order of execution does not affect the results.
Below is presented disproportion of sumRows and sumColumns (blue bar) executing time:
http://imgup.pl/di/G2UP/performance-issue-row-col.png

Much better performance is reached when executing two sumColumns connected with matrix transposition.
http://imgup.pl/di/HD8Y/performance-issue-col-col.png

Anyone have an idea why it happens?

Because one of the two access methods will inherently be more efficient on a GPU, due to the GPU memory behavior.

Typically, for row-major underlying storage, fairly naive column-summing kernels can be written that tend to run “fast” because the generated memory access patterns work well. This is in general related to the concept of memory coalescing for efficiency. It may be more difficult to get the same efficiency, and will generally result in more complex code, for row-summing, as it will usually depend on some form of parallel reduction.

There are a variety of strategies to address this. However you might also want to know that creation of integral images has been addressed in a number of libraries including NPP.