I have a strange performance problem using atomicAdd. My kernel sums up the rows and columns of a matrix. Therefore the matrix is copied to shared memory and summed up using the following code:
#define SMEM_INDEX(x, y) ((y) - ((y/THREADS_PER_BLOCK_Y) * THREADS_PER_BLOCK_Y)) * THREADS_PER_BLOCK_Y + \
(x) - ((x/THREADS_PER_BLOCK_X) * THREADS_PER_BLOCK_X)
...
//Add up rows
if ((threadIdx.y < THREADS_PER_BLOCK_Y) && (threadIdx.x == 0))
{
unsigned int tmp = 0;
#pragma unroll
for (int i = 0; i < THREADS_PER_BLOCK_X; i++)
{
//Add up bins
tmp += imgdata[SMEM_INDEX(i, threadIdx.y)];
}
atomicAdd(&d_vhisto[blockIdx.y * blockDim.y + threadIdx.y], tmp);
}
//Add up columns
if ((threadIdx.x < THREADS_PER_BLOCK_X) && (threadIdx.y == 0))
{
unsigned int tmp = 0;
#pragma unroll
for (int i = 0; i < THREADS_PER_BLOCK_Y; i++)
{
//Add up bins
tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];
}
atomicAdd(&d_hhisto[blockIdx.x * blockDim.x + threadIdx.x], tmp);
}
The code works fine, but when I comment out the first aromicAdd (the one used to add up the rows), my kernel runs in about half a millisecond. When not commenting out this line, I get a performance drop to 6 milliseconds External Image. The second atomicAdd does not influence the performance significantly. Below are some profiling results. The problem is obvious, the kernel branches. But why is this caused by this single instruction?
If you comment out the atomicAdd(), the whole preceding loop can be optimized out by the compiler.
This is true for both loops. However, the first loop runs significantly slower than the second, as it only uses two threads of each warp, wasting 30/32 of its resources. In the second loop, only one warp (with 16 threads used) remains active, while the other warps where threadIdx.y>0 do not execute at all.
If you comment out the atomicAdd(), the whole preceding loop can be optimized out by the compiler.
This is true for both loops. However, the first loop runs significantly slower than the second, as it only uses two threads of each warp, wasting 30/32 of its resources. In the second loop, only one warp (with 16 threads used) remains active, while the other warps where threadIdx.y>0 do not execute at all.
Thank you, I took both for loops into the second if block, now it runs like a charm in about 700 ms.
EDIT: I fine tuned my code and it now looks like this and I still get different runtimes for htype = VERTICAL and htype = HORIZONTAL. The difference is not dramatic (about 0.4 ms), I can live with it, but I still would like to know, what is causing this difference. (The reason is not the calculation of i_shifted.)
if ((threadIdx.x < BLOCK_SIDE_LENGTH) && (threadIdx.y == 0))
{
unsigned int tmp = 0;
unsigned int bin;
if (htype == HORIZONTAL || htype == BOTH)
{
//Add up columns
#pragma unroll
for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)
{
//Add up bins
tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];
}
bin = blockIdx.x * blockDim.x + threadIdx.x;
if (bin < (desc.d_width/desc.d_padding)) atomicAdd(&d_hhisto[bin], tmp);
}
tmp = 0;
if (htype == VERTICAL || htype == BOTH)
{
//Add up rows
#pragma unroll
for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)
{
//Shift i to avoid bank conflicts
int i_shifted = (i + threadIdx.x) % BLOCK_SIDE_LENGTH;
//Add up bins
tmp += imgdata[SMEM_INDEX(i_shifted, threadIdx.x)];
}
bin = blockIdx.y * blockDim.y + threadIdx.x;
if (bin < desc.d_height) atomicAdd(&d_vhisto[bin], tmp);
}
}
Thank you, I took both for loops into the second if block, now it runs like a charm in about 700 ms.
EDIT: I fine tuned my code and it now looks like this and I still get different runtimes for htype = VERTICAL and htype = HORIZONTAL. The difference is not dramatic (about 0.4 ms), I can live with it, but I still would like to know, what is causing this difference. (The reason is not the calculation of i_shifted.)
if ((threadIdx.x < BLOCK_SIDE_LENGTH) && (threadIdx.y == 0))
{
unsigned int tmp = 0;
unsigned int bin;
if (htype == HORIZONTAL || htype == BOTH)
{
//Add up columns
#pragma unroll
for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)
{
//Add up bins
tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];
}
bin = blockIdx.x * blockDim.x + threadIdx.x;
if (bin < (desc.d_width/desc.d_padding)) atomicAdd(&d_hhisto[bin], tmp);
}
tmp = 0;
if (htype == VERTICAL || htype == BOTH)
{
//Add up rows
#pragma unroll
for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)
{
//Shift i to avoid bank conflicts
int i_shifted = (i + threadIdx.x) % BLOCK_SIDE_LENGTH;
//Add up bins
tmp += imgdata[SMEM_INDEX(i_shifted, threadIdx.x)];
}
bin = blockIdx.y * blockDim.y + threadIdx.x;
if (bin < desc.d_height) atomicAdd(&d_vhisto[bin], tmp);
}
}