Thread memory concurrency within the same block?

Hi,

I have a simple doubt that i need to solve. Let´s say I have every thread in a block of threads incrementing an int for example. Is there any way to ensure that every thread increments it and is not disturbed by any other, i mean if memory accesses are syncronized and locked.

I have a kernel doing this:

[codebox]shared int pixels_that_change;

if(diffB > ruido && diffG > ruido && diffR > ruido)pixels_that_change++;

__syncthreads();

//Do something depending on the pixels_that_change value.[/codebox]

Can I be sure that every pixel_that_change++ is properly done and not mistaken by other threads?

Thanks!

Not using that code, no. If you want that counter increment to work correctly, you will need to use an atomic function. Shared memory atomic operations are only supported on compute capability 1.2 or greater devices.

Not using that code, no. If you want that counter increment to work correctly, you will need to use an atomic function. Shared memory atomic operations are only supported on compute capability 1.2 or greater devices.

Oh thanks, didnt know about atomic functions, do you have some example?

I was solving that in this way:

[codebox]shared bool change[256];

change[ty*blockDim.x + tx]=false;

if(diffB > ruido | diffG > ruido | diffR > ruido)change[ty*blockDim.x + tx]=true;

__syncthreads();

if(tx == 0)

{

	for(int i=0;i < 256 ;i++)

	{

		if(change[i]==true)pixels_cambian++;

	

	}

}

__syncthreads();

//Dome something…[/codebox]

But I thinks it is quite inefficient.

Oh thanks, didnt know about atomic functions, do you have some example?

I was solving that in this way:

[codebox]shared bool change[256];

change[ty*blockDim.x + tx]=false;

if(diffB > ruido | diffG > ruido | diffR > ruido)change[ty*blockDim.x + tx]=true;

__syncthreads();

if(tx == 0)

{

	for(int i=0;i < 256 ;i++)

	{

		if(change[i]==true)pixels_cambian++;

	

	}

}

__syncthreads();

//Dome something…[/codebox]

But I thinks it is quite inefficient.

Well, theres two ways of doing it. The easiest would be to replace ‘pixels_that_change++’ with ‘atomicAdd(&pixels_that_change, 1)’ in your first code sample and make sure your arch is sm_12 or higher. There will be a conflict every cycle for every thread so it will be slow, but it will work.

The other MUCH faster option is similiar to your second code block, just using a proper reduction algorithm using ints instead. Take a look at the reduction sample in the sdk. Also, I’m not entirely sure how bool arrays are stored in shared memory on the GPU, but if its less than 32-bits you may want to use an int array anyway to reduce the amount of bank conflicts?

Of course, if you have a Fermi capable card and you use bools, I can imagine you can pull off a clever trick using __ballot() and __popc() …

Well, theres two ways of doing it. The easiest would be to replace ‘pixels_that_change++’ with ‘atomicAdd(&pixels_that_change, 1)’ in your first code sample and make sure your arch is sm_12 or higher. There will be a conflict every cycle for every thread so it will be slow, but it will work.

The other MUCH faster option is similiar to your second code block, just using a proper reduction algorithm using ints instead. Take a look at the reduction sample in the sdk. Also, I’m not entirely sure how bool arrays are stored in shared memory on the GPU, but if its less than 32-bits you may want to use an int array anyway to reduce the amount of bank conflicts?

Of course, if you have a Fermi capable card and you use bools, I can imagine you can pull off a clever trick using __ballot() and __popc() …

Easiest is:

pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

........

........

pixels_that_change[threadIdx.x]++;

....

............

if (threadIdx.x == 0

{

   for(int i=0; i<blockDim.x; i++)

   {

	 sum += pixels_that_change[i]; 

   }

   // store sum whereever you want

}

Easiest is:

pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

........

........

pixels_that_change[threadIdx.x]++;

....

............

if (threadIdx.x == 0

{

   for(int i=0; i<blockDim.x; i++)

   {

	 sum += pixels_that_change[i]; 

   }

   // store sum whereever you want

}

Yes, that was my last aproach, ill take a look and see if it´s fast enough, since i have many blocks i think it can be efficient enough.

Thanks!

Yes, that was my last aproach, ill take a look and see if it´s fast enough, since i have many blocks i think it can be efficient enough.

Thanks!

If performance is an issue, take a look at the reduction sample in sdk as I mentioned:

http://developer.download.nvidia.com/compu…c/reduction.pdf

The specific optimization being:

[codebox]pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

pixels_that_change[threadIdx.x]++;

__syncthreads();

// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

{

    if (tid < s) 

    {

        pixels_that_change[tid] += pixels_that_change[tid + s];

    }

    __syncthreads();

}

// write result for this block to global mem

if (tid == 0) result = pixels_that_change[0];

[/codebox]

Theres even further optimized ones, but just not having all but one threads idle for your 256 or so elements will make a huge difference.

If performance is an issue, take a look at the reduction sample in sdk as I mentioned:

http://developer.download.nvidia.com/compu…c/reduction.pdf

The specific optimization being:

[codebox]pixels_that_change[threadIdx.x] = 0; // Initialy

sum = 0;

pixels_that_change[threadIdx.x]++;

__syncthreads();

// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

{

    if (tid < s) 

    {

        pixels_that_change[tid] += pixels_that_change[tid + s];

    }

    __syncthreads();

}

// write result for this block to global mem

if (tid == 0) result = pixels_that_change[0];

[/codebox]

Theres even further optimized ones, but just not having all but one threads idle for your 256 or so elements will make a huge difference.