Strange Performance Problem using atomicAdd

Hi,

I have a strange performance problem using atomicAdd. My kernel sums up the rows and columns of a matrix. Therefore the matrix is copied to shared memory and summed up using the following code:

#define SMEM_INDEX(x, y)	((y) - ((y/THREADS_PER_BLOCK_Y) * THREADS_PER_BLOCK_Y)) * THREADS_PER_BLOCK_Y + \

							 (x) - ((x/THREADS_PER_BLOCK_X) * THREADS_PER_BLOCK_X)

...

	//Add up rows

	if ((threadIdx.y < THREADS_PER_BLOCK_Y) && (threadIdx.x == 0))

	{

		unsigned int tmp = 0;

		#pragma unroll

		for (int i = 0; i < THREADS_PER_BLOCK_X; i++)

		{

			//Add up bins

			tmp += imgdata[SMEM_INDEX(i, threadIdx.y)];

		}

		atomicAdd(&d_vhisto[blockIdx.y * blockDim.y + threadIdx.y], tmp);

	}

	//Add up columns

	if ((threadIdx.x < THREADS_PER_BLOCK_X) && (threadIdx.y == 0))

	{

		unsigned int tmp = 0;

		#pragma unroll

		for (int i = 0; i < THREADS_PER_BLOCK_Y; i++)

		{

			//Add up bins

			tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];

		}

		atomicAdd(&d_hhisto[blockIdx.x * blockDim.x + threadIdx.x], tmp);

	}

The code works fine, but when I comment out the first aromicAdd (the one used to add up the rows), my kernel runs in about half a millisecond. When not commenting out this line, I get a performance drop to 6 milliseconds External Image. The second atomicAdd does not influence the performance significantly. Below are some profiling results. The problem is obvious, the kernel branches. But why is this caused by this single instruction?

Without first atomicAdd:

GPU time: 446.848

CPU tim: 541.433

grid X: 40

gridY: 30

blockX: 16

blockY: 16

shared mem: 1084

register: 6

occupancy: 1

branch: 9568

div. branch: 598

instructions: 142358

cta launched: 1200

gld32b: 0

gst32b: 0

gst64b: 321993

tex cache hit: 0

tex cache miss: 38400

With first atomicAdd:

GPU time: 5770.69

CPU tim: 5871.77

grid X: 40

gridY: 30

blockX: 16

blockY: 16

shared mem: 1084

register: 7

occupancy: 1

branch: 28800

div. branch: 5400

instructions: 286858

cta launched: 1200

gld32b: 0

gst32b: 0

gst64b: 105985

tex cache hit: 0

tex cache miss: 38400

Thanks in advance,

Kwyjibo

If you comment out the atomicAdd(), the whole preceding loop can be optimized out by the compiler.

This is true for both loops. However, the first loop runs significantly slower than the second, as it only uses two threads of each warp, wasting 30/32 of its resources. In the second loop, only one warp (with 16 threads used) remains active, while the other warps where threadIdx.y>0 do not execute at all.

If you comment out the atomicAdd(), the whole preceding loop can be optimized out by the compiler.

This is true for both loops. However, the first loop runs significantly slower than the second, as it only uses two threads of each warp, wasting 30/32 of its resources. In the second loop, only one warp (with 16 threads used) remains active, while the other warps where threadIdx.y>0 do not execute at all.

Thank you, I took both for loops into the second if block, now it runs like a charm in about 700 ms.

EDIT: I fine tuned my code and it now looks like this and I still get different runtimes for htype = VERTICAL and htype = HORIZONTAL. The difference is not dramatic (about 0.4 ms), I can live with it, but I still would like to know, what is causing this difference. (The reason is not the calculation of i_shifted.)

if ((threadIdx.x < BLOCK_SIDE_LENGTH) && (threadIdx.y == 0))

	{

		unsigned int tmp = 0;

		unsigned int bin;

		if (htype == HORIZONTAL || htype == BOTH)

		{

			//Add up columns

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Add up bins

				tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];

			}

			bin = blockIdx.x * blockDim.x + threadIdx.x;

			if (bin < (desc.d_width/desc.d_padding)) atomicAdd(&d_hhisto[bin], tmp);

		}

		tmp = 0;

		if (htype == VERTICAL || htype == BOTH)

		{

			//Add up rows

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Shift i to avoid bank conflicts

				int i_shifted = (i + threadIdx.x) % BLOCK_SIDE_LENGTH;

				//Add up bins

				tmp += imgdata[SMEM_INDEX(i_shifted, threadIdx.x)];

			}

			bin = blockIdx.y * blockDim.y + threadIdx.x;

			if (bin < desc.d_height) atomicAdd(&d_vhisto[bin], tmp);

		}

	}

Any idea?

Thanks in advance,

Kwyjibo

Thank you, I took both for loops into the second if block, now it runs like a charm in about 700 ms.

EDIT: I fine tuned my code and it now looks like this and I still get different runtimes for htype = VERTICAL and htype = HORIZONTAL. The difference is not dramatic (about 0.4 ms), I can live with it, but I still would like to know, what is causing this difference. (The reason is not the calculation of i_shifted.)

if ((threadIdx.x < BLOCK_SIDE_LENGTH) && (threadIdx.y == 0))

	{

		unsigned int tmp = 0;

		unsigned int bin;

		if (htype == HORIZONTAL || htype == BOTH)

		{

			//Add up columns

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Add up bins

				tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];

			}

			bin = blockIdx.x * blockDim.x + threadIdx.x;

			if (bin < (desc.d_width/desc.d_padding)) atomicAdd(&d_hhisto[bin], tmp);

		}

		tmp = 0;

		if (htype == VERTICAL || htype == BOTH)

		{

			//Add up rows

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Shift i to avoid bank conflicts

				int i_shifted = (i + threadIdx.x) % BLOCK_SIDE_LENGTH;

				//Add up bins

				tmp += imgdata[SMEM_INDEX(i_shifted, threadIdx.x)];

			}

			bin = blockIdx.y * blockDim.y + threadIdx.x;

			if (bin < desc.d_height) atomicAdd(&d_vhisto[bin], tmp);

		}

	}

Any idea?

Thanks in advance,

Kwyjibo