what's the best way to define a counter that can be accessed by all threads

In my program, I need to have a counter in GPU that can be accessed by all threads.
The purpose of this counter is to find out how many threads have finished designated task. The logic will work like this pseudo code:
global test_funciton()
{
int counter = total_threads_number;
while (counter > 0)
{
If(not processed)
do something for this thread id;
if(criterion is met)
{
set this thread as ‘processed’;
counter = counter - 1;
}
__threadfence();
}
}

What is the best way to define such a counter? My understanding is it must be in the global memory. So I think I need to cudaMalloc an array with just one element, which serves as this counter. This may work but when all threads are trying to access this counter, will the thread execution be serialized? Is there a better to define such as counter?

Thank you,

AtomicInc, and yes it works on global mem.

— Only on Compute 1.2 and above —

AtomicInc() on shared memory followed by __syncthreads, followed by if (thread == 0) AtomicAdd(GlobalMemory)

I wrote a very simple program to test the behavior of atomicInc or directly using ++. The code is as follows:

The host side:

cuPrintfKernel<<< 1, 4 >>> (d_counter);

	cudaThreadSynchronize();

	cudaPrintfDisplay(stdout, true);

The kernel is:

__global__ void cuPrintfKernel(int *counter)

{

	//atomicInc((unsigned int*)&counter[0], 1); // what will happen? serialized automatically?

	counter[0]++;

	__threadfence_block();

	cuPrintf("counter: %d ", counter[0] );

}

If I use atomicInc(…), this is the output:

[0, 0]: counter: 0 [0, 1]: counter: 0 [0, 2]: counter: 0 [0, 3]: counter: 0;

If I use counter[0]++, this is the output:

[0, 0]: counter: 1 [0, 1]: counter: 1 [0, 2]: counter: 1 [0, 3]: counter: 1;

This means none of them did it right because in this case I expect the counter = 4. What should I do to get it right?

Thank you,

I solve the problem. I misunderstood atomicInc(). I should use atomicAdd() instead of atomicInc().

I searched the previous posts and saw this is not efficient though.