Multithreading Counter


I hope someone could help me:

I have one block with many threads within it.

Each of these threads makes a computation. But not each of the result is valid and has to be returned.

So my idea was to create a counter that counts, how many threads have a valid result and each thread writes back its return value if it is valid.

__global__ static void lineTest(int *ret) {

	const int idx = threadIdx.x;

	__shared__ volatile unsigned int  returnIndex;

//maybe some init stuff that only first thread has 2 do

	if (threadIdx.x == 0) {

	   returnIndex == 1;



//some stuff, that not everybody writes a result

	bool isValid = ((idx % 3) == 1) && ((idx % 11) == 2);

	if (isValid) {

		ret[++returnIndex] = idx % 7;



//write back size

	 if (threadIdx.x == 0)


		 ret[0] = returnIndex;



In deviceemu mode it works as expected, the array is filled and on index 0 there is the size of the array.

Any suggestions, why it does not work on GPU?


This line is a thread race… you have multiple threads simultaneously accessing and updating returnAccess.

if (isValid) {

		ret[++returnIndex] = idx % 7;


Solutions are many… it’s a common problem.

If you have only rare values to write, then a global atomic increment will work nicely.

But if you have many successful values to write, it can be better to write ALL values, including failures, into predetermined slots, then run a compaction kernel to snip out the unused values. This approach has tons of variants too, ranging from custom coded ones to simple but generic library call (like CUDPP’s).

Thx for the fast reply.

But isn’t ++returnIndex atomic? So the next thread will have an other returnIndex value to increase and to access the array?

But writing every value sounds better for me :) as longer as I think about it.

No, each thread does not have its own returnIndex. You declared it as a shared variable. This is correct, though, since per-thread indexing would be meaningless.

If you do write every value, then it’s easy since every thread knows exactly where it will write. It also means perfect coalescing, even on 1.0 hardware.

No, although that is one line of C, it compiles down to three instructions:

  1. Read returnIndex from shared memory to a register.

(Since you declared returnIndex as volatile, the compiler is not allowed to reuse returnIndex from a previous read)

  1. Increment that register by 1.

  2. Write that register back to returnIndex location in shared memory.

Just for future reference, compute capability 1.2 devices and higher (GTX 200 cards and some others) have shared memory atomic operations which can do this in a thread-safe way.