I would like to share an optimization issue, in my opinion connected with global memory access. I’m currently implementing row-column integral image algorithm (first step is summing all columns and then all rows). Pixels are placed row by row.
block.x = 512; block.y = 1;
grid.x = width / block.x + 1; grid.y = 1;
sumColumns<<<grid, block>>>(&out[widthIntegral + 1], width, height, widthIntegral);
block.x = 1; block.y = 512;
grid.y = height / block.y + 1; grid.x = 1;
sumRows<<<grid, block>>>(&out[widthIntegral + 1], width, height, widthIntegral);
and the functions:
__global__ void sumRows(float *out, int width, int height, int strideOut) {
int indexOut = blockDim.y * blockIdx.y + threadIdx.y;
if (indexOut < height) {
float sum = 0.0f;
int counter = 0;
indexOut *= strideOut;
for (; counter < width; counter++, indexOut++) {
sum += out[indexOut];
out[indexOut] = sum; // pierwsza komorka zerowa
}
}
}
__global__ void sumColumns(float *out, int width, int height, int strideOut) {
int indexOut = blockDim.x * blockIdx.x + threadIdx.x;
if (indexOut < width) {
float sum = 0.0f;
int counter = 0;
//endOut = indexOut + height * strideOut;
for (; counter < height; counter++, indexOut += strideOut) {
sum += out[indexOut];
out[indexOut] = sum;
}
}
}
The problem is the time of computing sumRows compared with sumColumns. The latter is much more effective, but the algorithm is symetrical. Swaping the order of execution does not affect the results.
Below is presented disproportion of sumRows and sumColumns (blue bar) executing time:
http://imgup.pl/di/G2UP/performance-issue-row-col.png
Much better performance is reached when executing two sumColumns connected with matrix transposition.
http://imgup.pl/di/HD8Y/performance-issue-col-col.png
Anyone have an idea why it happens?