Thread memory concurrency within the same block?

mayouuu · September 29, 2010, 8:16am

Hi,

I have a simple doubt that i need to solve. LetÂ´s say I have every thread in a block of threads incrementing an int for example. Is there any way to ensure that every thread increments it and is not disturbed by any other, i mean if memory accesses are syncronized and locked.

I have a kernel doing this:

[codebox]shared int pixels_that_change;

if(diffB > ruido && diffG > ruido && diffR > ruido)pixels_that_change++;

__syncthreads();

//Do something depending on the pixels_that_change value.[/codebox]

Can I be sure that every pixel_that_change++ is properly done and not mistaken by other threads?

Thanks!

avidday · September 29, 2010, 8:23am

Not using that code, no. If you want that counter increment to work correctly, you will need to use an atomic function. Shared memory atomic operations are only supported on compute capability 1.2 or greater devices.

avidday · September 29, 2010, 8:23am

Not using that code, no. If you want that counter increment to work correctly, you will need to use an atomic function. Shared memory atomic operations are only supported on compute capability 1.2 or greater devices.

mayouuu · September 29, 2010, 8:34am

Oh thanks, didnt know about atomic functions, do you have some example?

I was solving that in this way:

[codebox]shared bool change[256];

change[ty*blockDim.x + tx]=false;

if(diffB > ruido | diffG > ruido | diffR > ruido)change[ty*blockDim.x + tx]=true;

__syncthreads();

if(tx == 0)

{

	for(int i=0;i < 256 ;i++)

	{

		if(change[i]==true)pixels_cambian++;

	

	}

}

__syncthreads();

//Dome something…[/codebox]

But I thinks it is quite inefficient.

mayouuu · September 29, 2010, 8:34am

Oh thanks, didnt know about atomic functions, do you have some example?

I was solving that in this way:

[codebox]shared bool change[256];

change[ty*blockDim.x + tx]=false;

if(diffB > ruido | diffG > ruido | diffR > ruido)change[ty*blockDim.x + tx]=true;

__syncthreads();

if(tx == 0)

{

	for(int i=0;i < 256 ;i++)

	{

		if(change[i]==true)pixels_cambian++;

	

	}

}

__syncthreads();

//Dome something…[/codebox]

But I thinks it is quite inefficient.

Antagonistic · September 29, 2010, 9:05am

Well, theres two ways of doing it. The easiest would be to replace ‘pixels_that_change++’ with ‘atomicAdd(&pixels_that_change, 1)’ in your first code sample and make sure your arch is sm_12 or higher. There will be a conflict every cycle for every thread so it will be slow, but it will work.

The other MUCH faster option is similiar to your second code block, just using a proper reduction algorithm using ints instead. Take a look at the reduction sample in the sdk. Also, I’m not entirely sure how bool arrays are stored in shared memory on the GPU, but if its less than 32-bits you may want to use an int array anyway to reduce the amount of bank conflicts?

Of course, if you have a Fermi capable card and you use bools, I can imagine you can pull off a clever trick using __ballot() and __popc() …

Antagonistic · September 29, 2010, 9:05am

Well, theres two ways of doing it. The easiest would be to replace ‘pixels_that_change++’ with ‘atomicAdd(&pixels_that_change, 1)’ in your first code sample and make sure your arch is sm_12 or higher. There will be a conflict every cycle for every thread so it will be slow, but it will work.

The other MUCH faster option is similiar to your second code block, just using a proper reduction algorithm using ints instead. Take a look at the reduction sample in the sdk. Also, I’m not entirely sure how bool arrays are stored in shared memory on the GPU, but if its less than 32-bits you may want to use an int array anyway to reduce the amount of bank conflicts?

Of course, if you have a Fermi capable card and you use bools, I can imagine you can pull off a clever trick using __ballot() and __popc() …

Sarnath · September 29, 2010, 11:28am

Easiest is:

pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

........

........

pixels_that_change[threadIdx.x]++;

....

............

if (threadIdx.x == 0

{

   for(int i=0; i<blockDim.x; i++)

   {

	 sum += pixels_that_change[i]; 

   }

   // store sum whereever you want

}

Sarnath · September 29, 2010, 11:28am

Easiest is:

pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

........

........

pixels_that_change[threadIdx.x]++;

....

............

if (threadIdx.x == 0

{

   for(int i=0; i<blockDim.x; i++)

   {

	 sum += pixels_that_change[i]; 

   }

   // store sum whereever you want

}

mayouuu · September 29, 2010, 11:47am

Easiest is:

pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

........

........

pixels_that_change[threadIdx.x]++;

....

............

if (threadIdx.x == 0

{

   for(int i=0; i<blockDim.x; i++)

   {

	 sum += pixels_that_change[i]; 

   }

   // store sum whereever you want

}

Yes, that was my last aproach, ill take a look and see if itÂ´s fast enough, since i have many blocks i think it can be efficient enough.

Thanks!

mayouuu · September 29, 2010, 11:47am

Easiest is:

pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

........

........

pixels_that_change[threadIdx.x]++;

....

............

if (threadIdx.x == 0

{

   for(int i=0; i<blockDim.x; i++)

   {

	 sum += pixels_that_change[i]; 

   }

   // store sum whereever you want

}

Yes, that was my last aproach, ill take a look and see if itÂ´s fast enough, since i have many blocks i think it can be efficient enough.

Thanks!

Antagonistic · September 29, 2010, 1:01pm

If performance is an issue, take a look at the reduction sample in sdk as I mentioned:

http://developer.download.nvidia.com/compu…c/reduction.pdf

The specific optimization being:

[codebox]pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

…

pixels_that_change[threadIdx.x]++;

…

__syncthreads();

// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

{

    if (tid < s) 

    {

        pixels_that_change[tid] += pixels_that_change[tid + s];

    }

    __syncthreads();

}

// write result for this block to global mem

if (tid == 0) result = pixels_that_change[0];

[/codebox]

Theres even further optimized ones, but just not having all but one threads idle for your 256 or so elements will make a huge difference.

Antagonistic · September 29, 2010, 1:01pm

If performance is an issue, take a look at the reduction sample in sdk as I mentioned:

http://developer.download.nvidia.com/compu…c/reduction.pdf

The specific optimization being:

[codebox]pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

…

pixels_that_change[threadIdx.x]++;

…

__syncthreads();

// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

{

    if (tid < s) 

    {

        pixels_that_change[tid] += pixels_that_change[tid + s];

    }

    __syncthreads();

}

// write result for this block to global mem

if (tid == 0) result = pixels_that_change[0];

[/codebox]

Theres even further optimized ones, but just not having all but one threads idle for your 256 or so elements will make a huge difference.

Topic		Replies	Views
Is it possible to increment a variable by different threads at the same time ? CUDA Programming and Performance	3	1955	November 10, 2009
__syncthreads and shared memory CUDA Programming and Performance	21	4626	June 15, 2011
atomic operations to shared memory CUDA Programming and Performance	0	2159	October 14, 2008
shared memory problem usage in variables CUDA Programming and Performance	8	2550	September 22, 2010
Several threads attacking the same position. Superposition in that position. CUDA Programming and Performance	31	13967	October 19, 2010
[Newbie] Operations of type "+=" on the shared memory not working as expected CUDA Programming and Performance	2	2099	February 10, 2009
what's the best way to define a counter that can be accessed by all threads CUDA Programming and Performance	4	1462	May 3, 2010
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6840	June 2, 2012
Deliberate race condition CUDA Programming and Performance	4	103	January 14, 2025
Sharing a single counter (variable) across multiple thread(s) block(s) CUDA Programming and Performance	13	4607	December 27, 2017

Thread memory concurrency within the same block?

Related topics