Atomic instructions on global and shared memory

Hi gys,

I am trying to use atomicadd instruction on shared memory to speed up my code, but it is having the opposite effect.

Atomic instruction on global memory is as follows:

__global__ void calcCentroidKernel( int *gpu_labels, int *gpu_nRegions, int *gpu_regionOff, int *gpu_regionSize, int *gpu_centroid, int *i, int pitch)

{

	int     x = (blockIdx.x*blockDim.x)+threadIdx.x;

        int     y = (blockIdx.y*blockDim.y)+threadIdx.y;  

        int index = x+y*pitch;

        int     j = 0;

	if( gpu_labels[index] == index)

	{

		atomicAdd( i, 1);						

		atomicAdd( gpu_regionOff + (*i), index);		

	}

	for( j=0; j < *gpu_nRegions; j++)

	{

		if( gpu_labels[index] == gpu_regionOff[j])

			break;

	}

	if( gpu_labels[index] != -1)

	{

		atomicAdd( gpu_centroid+(2*j), x);

		atomicAdd( gpu_centroid+(2*j)+1, y);

		atomicAdd( gpu_regionSize+j, 1);

	}	

}

Atomic instruction on shared memory is as follows:

__global__ void calcCentroidSharedKernel(int *gpu_labels,int *gpu_nRegions,int *gpu_regionOff,int *gpu_regionsSize,int *gpu_centroid,int *i,int pitch)

{

	extern __shared__ int sMem[];

	int     	    x = (blockIdx.x*blockDim.x)+threadIdx.x;

        int     	    y = (blockIdx.y*blockDim.y)+threadIdx.y;  

        int 	        index = x+y*pitch;

        int 		    j = 0; 

        int 	shSize_Offset = 0;

        int shCentroid_Offset = *gpu_nRegions;

if( gpu_labels[index] == index)

        {

		atomicAdd( i, 1);

		atomicAdd( gpu_regionOff + (*i), index);

        }

for( j=0; j < *gpu_nRegions; j++)

	{

			if( gpu_labels[index] == gpu_regionOff[j])

			break;

	}

	if(gpu_labels[index] != -1)

	{

		atomicAdd( sMem+shCentroid_Offset+(2*j), x);

		atomicAdd( sMem+shCentroid_Offset+(2*j)+1, y);

		atomicAdd( sMem+shSize_Offset+j, 1);

	}

	__syncthreads();

	atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

	atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);

	atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

	

}

I am unable to understand why the timing of my code increased after using shared memory. Though it looks like shared memory usage is just a over head but considering that shared memory is used >200 times, the usage of shared memory should have provided considerable timing improvement…

1 Like

Your shared memory version is doing the shared memory atomic adds in addition to those done by the other version as well. So why would you expect it to run faster?

BTW: Did you compare tesults of both versions? The shared memory version lacks initialization of the shared memory, so you’d be extremely lucky if they produced the same results.

Shared memory should perform faster than global because of the following reasons…

  1. Shared Memory access is faster than global memory access.
  2. Assuming i have around 1000 additions to be done…
    So in the first case all those additions will be done on global memory (assuming 1000 accesses)
    Whereas in the second case firstly additions will be done on shared memory and then depending on grid size ((16,12) in this case) there will be much fewer global atomic additions. So theoretically second should be faster

And yes you are right i didnt initialize shared memory (though i only measured timing didnt check the output).

I think i messed up after __syncthreads()…

The code after it should be this…

__syncthreads();

if(index == 0)

{

        atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

        atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);

        atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

}

I am able to get performance improvement now. Could you point out any optimizations that can be done in this code (The timing i am recieving now is still too much). Is there any other approach that i can follow?

You probably want to make this

__syncthreads();

if((threadIdx.x==0) && (threadIdx.y==0))

{

        atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

        atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);

        atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

}

in order to get correct results.

I don’t understand what these kernels are doing, so I can’t suggest improvements. However the scattered atomicAdds are probably the worst case for memory throughput and anything you could do to at least partly coalesce them would certainly help.

1 Like

These kernels are calculating the size and centroid of an arbitrary object shape in a image. The region Offset is the index value of top left pixel of the shape. I know the atomicAdds are all scattered accross the object shape, I couldnt find a way to coalesce them.

After the modifications this is code i came up with:

__global__ void calcCentroidSharedKernel(int *gpu_labels,int *gpu_nRegions,int *gpu_regionOff,int *gpu_regionsSize,int *gpu_centroid,int *i,int pitch)

{

	extern __shared__ int sMem[];

	int     			x = (blockIdx.x*blockDim.x)+threadIdx.x;

        int     			y = (blockIdx.y*blockDim.y)+threadIdx.y;  

        int 			    index = x+y*pitch;

        int 				j = 0; 

        int 	            shSize_Offset = 0;

        int             shCentroid_Offset = *gpu_nRegions-1;         // This line is modified.

	if((index >= 0) && (index < (3*(*gpu_nRegions))))            // This is the added lines of code.

		sMem[index] = 0;

if( gpu_labels[index] == index)

        {

		atomicAdd( i, 1);

		atomicAdd( gpu_regionOff + (*i), index);

        }

for( j=0; j < *gpu_nRegions; j++)

	{

			if( gpu_labels[index] == gpu_regionOff[j])

			break;

	}

	if(gpu_labels[index] != -1)

	{

		atomicAdd( sMem+shCentroid_Offset+(2*j), x);

		atomicAdd( sMem+shCentroid_Offset+(2*j)+1, y);                          /******** Program crashing code line **********/

		atomicAdd( sMem+shSize_Offset+j, 1);

	}

	__syncthreads();

	if((threadIdx.x == 0) && (threadIdx.y == 0))                                    // This is the condition modified

	{

		atomicAdd(gpu_centroid+(2*j), sMem[shCentroid_Offset+(2*j)]);

		atomicAdd(gpu_centroid+(2*j)+1, sMem[shCentroid_Offset+(2*j)+1]);       /******** Program crashing code line **********/

		atomicAdd(gpu_regionsSize+j, sMem[shSize_Offset+j]);

	}

}

While I was testing this with a test video, the program abruptly crashed. I was able to find out that if i comment the marked atomic instructions the video runs fine.

EDIT: I forgot to ask can you point out reason for the crash…

1 Like

What data are you passing to the kernel? If *gpu_nRegions == 0, shCentroid_Offset is set to -1, the loop with ‘j’ as the index will never execute (so the value of ‘j’ stays zero). That would cause the first atomicAdd in each of the blocks at the bottom to use an out-of-bounds address (not the second/middle lines as you’ve marked with the comments).

1 Like

If *gpu_nRegions == 0 then the kernel is never called. Parameters are as follow:

@ gpu_labels = It is global memory that contains -1 value for each pixel that is background. And some positive integer where there is a shape. All the pixels within that shape have the same value (I had to do connected comppnent labelling before this) and the value is top left pixel of the shape.

@ gpu_nRegions = It contains total number of shapes in the image i.e total number of cirles,sqaures, etc…

@ gpu_regionOff = it is top left pixel of each shape.

@ gpu_regionsSize = It is size of the shape to be found.

@ gpu_centroid = It is the centroid of each shape to be found.

@ i = It is just a counter.

@ pitch = It is the image width.

Hi gys,

I came up with some kind of workaround to the crashing problem…

Initially the code used to call the kernel was this:

int *i;

	cudaMalloc( (void **)&i, sizeof(int));

	cudaMemset( i, -1, sizeof(int));

	int shSize = (nRegions)*3*sizeof(int);

	dim3 block( 15, 16, 1);

	dim3 grid( imageW / block.x, imageH / block.y, 1);

	if(shSize != 0)

	{

		calcCentroidSharedKernel<<<grid,block,shSize>>>(gpu_labels,gpu_nRegions,gpu_regionOff,gpu_regionSize,gpu_centroid,i,imageW);

	}

I modified the code to:

int *i;

	cudaMalloc( (void **)&i, sizeof(int));

	cudaMemset( i, -1, sizeof(int));

	int shSize = (nRegions)*4*sizeof(int);         // This is the line modified

	dim3 block( 15, 16, 1);

	dim3 grid( imageW / block.x, imageH / block.y, 1);

	if(shSize != 0)

	{

		calcCentroidSharedKernel<<<grid,block,shSize>>>(gpu_labels,gpu_nRegions,gpu_regionOff,gpu_regionSize,gpu_centroid,i,imageW);

	}

also inside the kernel the shCentroid_Offset should be = *gpu_nRegions; , not *gpu_nRegions-1;

The code now works fine, but the shared memory gets wasted this way. I am just curious why it didnt work previously… (somehow i think the size of shared memory 3*nRegions was not enough )

1 Like

Does it really faster than global? Thank you!!
Also, your discussion is very useful for later learner like me!!!