GPU Thread Synchronisation

I am using a monte-carlo approach to solve a perticular problem, the thing is writting the output to a buffer
Now each thread only handles 1 tiny part of the problem, most threads end up terminating early because there is nothing to write
ie: the output generated by that combination is invalid, the thing is I want to put all the results in a buffer,
The simplest way I can think of this is like a struct

struct buffer
int numberOfResults
int values[2048]

The thread reaches the output stage
reads numberOfResults,

location = buffer.NumberOfResults;

Then increments the buffer Counter

Write the output to the buffer
values[location]= output;

Now I know this can be done with normal threads in the CPU.
With a Lock on b/w reading the NumberOfResults values and the increment.

I don’t know how this can be done with a GPU thread, its a very simple queue


It can be done by atomic instructions. Sth. like:
int oldIdx = atomicAdd(&buffer.numberOfResults, 1);
buffer.values[oldIdx] = …;

Only works for compute capability 1.3 or above though.

Atomic operations, as the previous post stated will work. However, note that your performance will be absolutely terrible if a majority of your threads do write output (they will all essentially be serialized waiting to increment that counter).

If a majority of your threads do write output, then you will want to use stream compaction to reduce the output in parallel.

What is a majority in this case? I have no idea. It could be a little as 10%. Only a benchmark will truly be able to tell.