Strange Performance Problem using atomicAdd

Kwyjibo2010 · September 11, 2010, 5:57pm

Hi,

I have a strange performance problem using atomicAdd. My kernel sums up the rows and columns of a matrix. Therefore the matrix is copied to shared memory and summed up using the following code:

#define SMEM_INDEX(x, y)	((y) - ((y/THREADS_PER_BLOCK_Y) * THREADS_PER_BLOCK_Y)) * THREADS_PER_BLOCK_Y + \

							 (x) - ((x/THREADS_PER_BLOCK_X) * THREADS_PER_BLOCK_X)

...

	//Add up rows

	if ((threadIdx.y < THREADS_PER_BLOCK_Y) && (threadIdx.x == 0))

	{

		unsigned int tmp = 0;

		#pragma unroll

		for (int i = 0; i < THREADS_PER_BLOCK_X; i++)

		{

			//Add up bins

			tmp += imgdata[SMEM_INDEX(i, threadIdx.y)];

		}

		atomicAdd(&d_vhisto[blockIdx.y * blockDim.y + threadIdx.y], tmp);

	}

	//Add up columns

	if ((threadIdx.x < THREADS_PER_BLOCK_X) && (threadIdx.y == 0))

	{

		unsigned int tmp = 0;

		#pragma unroll

		for (int i = 0; i < THREADS_PER_BLOCK_Y; i++)

		{

			//Add up bins

			tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];

		}

		atomicAdd(&d_hhisto[blockIdx.x * blockDim.x + threadIdx.x], tmp);

	}

The code works fine, but when I comment out the first aromicAdd (the one used to add up the rows), my kernel runs in about half a millisecond. When not commenting out this line, I get a performance drop to 6 milliseconds External Image. The second atomicAdd does not influence the performance significantly. Below are some profiling results. The problem is obvious, the kernel branches. But why is this caused by this single instruction?

Without first atomicAdd:

GPU time: 446.848

CPU tim: 541.433

grid X: 40

gridY: 30

blockX: 16

blockY: 16

shared mem: 1084

register: 6

occupancy: 1

branch: 9568

div. branch: 598

instructions: 142358

cta launched: 1200

gld32b: 0

gst32b: 0

gst64b: 321993

tex cache hit: 0

tex cache miss: 38400

With first atomicAdd:

GPU time: 5770.69

CPU tim: 5871.77

grid X: 40

gridY: 30

blockX: 16

blockY: 16

shared mem: 1084

register: 7

occupancy: 1

branch: 28800

div. branch: 5400

instructions: 286858

cta launched: 1200

gld32b: 0

gst32b: 0

gst64b: 105985

tex cache hit: 0

tex cache miss: 38400

Thanks in advance,

Kwyjibo

tera · September 12, 2010, 7:21am

If you comment out the atomicAdd(), the whole preceding loop can be optimized out by the compiler.

This is true for both loops. However, the first loop runs significantly slower than the second, as it only uses two threads of each warp, wasting 30/32 of its resources. In the second loop, only one warp (with 16 threads used) remains active, while the other warps where threadIdx.y>0 do not execute at all.

tera · September 12, 2010, 7:21am

If you comment out the atomicAdd(), the whole preceding loop can be optimized out by the compiler.

This is true for both loops. However, the first loop runs significantly slower than the second, as it only uses two threads of each warp, wasting 30/32 of its resources. In the second loop, only one warp (with 16 threads used) remains active, while the other warps where threadIdx.y>0 do not execute at all.

Kwyjibo2010 · September 12, 2010, 9:18am

Thank you, I took both for loops into the second if block, now it runs like a charm in about 700 ms.

EDIT: I fine tuned my code and it now looks like this and I still get different runtimes for htype = VERTICAL and htype = HORIZONTAL. The difference is not dramatic (about 0.4 ms), I can live with it, but I still would like to know, what is causing this difference. (The reason is not the calculation of i_shifted.)

if ((threadIdx.x < BLOCK_SIDE_LENGTH) && (threadIdx.y == 0))

	{

		unsigned int tmp = 0;

		unsigned int bin;

		if (htype == HORIZONTAL || htype == BOTH)

		{

			//Add up columns

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Add up bins

				tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];

			}

			bin = blockIdx.x * blockDim.x + threadIdx.x;

			if (bin < (desc.d_width/desc.d_padding)) atomicAdd(&d_hhisto[bin], tmp);

		}

		tmp = 0;

		if (htype == VERTICAL || htype == BOTH)

		{

			//Add up rows

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Shift i to avoid bank conflicts

				int i_shifted = (i + threadIdx.x) % BLOCK_SIDE_LENGTH;

				//Add up bins

				tmp += imgdata[SMEM_INDEX(i_shifted, threadIdx.x)];

			}

			bin = blockIdx.y * blockDim.y + threadIdx.x;

			if (bin < desc.d_height) atomicAdd(&d_vhisto[bin], tmp);

		}

	}

Any idea?

Thanks in advance,

Kwyjibo

Kwyjibo2010 · September 12, 2010, 9:18am

Thank you, I took both for loops into the second if block, now it runs like a charm in about 700 ms.

EDIT: I fine tuned my code and it now looks like this and I still get different runtimes for htype = VERTICAL and htype = HORIZONTAL. The difference is not dramatic (about 0.4 ms), I can live with it, but I still would like to know, what is causing this difference. (The reason is not the calculation of i_shifted.)

if ((threadIdx.x < BLOCK_SIDE_LENGTH) && (threadIdx.y == 0))

	{

		unsigned int tmp = 0;

		unsigned int bin;

		if (htype == HORIZONTAL || htype == BOTH)

		{

			//Add up columns

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Add up bins

				tmp += imgdata[SMEM_INDEX(threadIdx.x, i)];

			}

			bin = blockIdx.x * blockDim.x + threadIdx.x;

			if (bin < (desc.d_width/desc.d_padding)) atomicAdd(&d_hhisto[bin], tmp);

		}

		tmp = 0;

		if (htype == VERTICAL || htype == BOTH)

		{

			//Add up rows

			#pragma unroll

			for (int i = 0; i < BLOCK_SIDE_LENGTH; i++)

			{

				//Shift i to avoid bank conflicts

				int i_shifted = (i + threadIdx.x) % BLOCK_SIDE_LENGTH;

				//Add up bins

				tmp += imgdata[SMEM_INDEX(i_shifted, threadIdx.x)];

			}

			bin = blockIdx.y * blockDim.y + threadIdx.x;

			if (bin < desc.d_height) atomicAdd(&d_vhisto[bin], tmp);

		}

	}

Any idea?

Thanks in advance,

Kwyjibo

Topic		Replies	Views
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1405	March 4, 2019
Atomic instructions on global and shared memory CUDA Programming and Performance	9	2680	May 27, 2022
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19758	January 21, 2010
Using atomicAdd to step through an array CUDA Programming and Performance	7	3907	May 24, 2011
Why does the atomicAdd work faster than â€˜+=â€™? CUDA Programming and Performance	3	12189	November 2, 2011
How to speed up AtomicAdd kernel using shared memory CUDA Programming and Performance	9	9578	September 30, 2022
can you give me sample code for atomicAdd()? CUDA Programming and Performance	9	48507	June 5, 2009
Really simple while loop issues CUDA Programming and Performance	4	3183	October 27, 2014
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16303	January 30, 2011
AtomicAdd algorithm CUDA Programming and Performance	7	3816	August 25, 2009

Strange Performance Problem using atomicAdd

Related topics