Variable Number of Results

My problem is that I am launching about 10^6 threads, they check for some condition and about 100 of them write some value that its computed.

A simple way would be write 10^6 ones or zeros into a array and then perform a scan on it and then compute the offsets needed and load these values back and then do the writes .

Another option which i am not very sure of is to pre-declare a buffer of about 1000 values and then have a pointer which points to the tail of the buffer and use Atomic Adds to increment it and then read the value just after the increment which tells the thread where i have to put the value.

Lets say this is the code

if ( some_condition ) {

	atomicAdd( pointer_address, 1 );

	buffer[pointer_address] =  value_computed;

}

Lets say thread no 1000 starts executing this bit of code at x clock cycle and thread 2000 starts at time x+1 ;

obviously thread 1000s atomic add would lock access to memory loctation pointer_address, till it finishes incrementing it.

Now will the next atomic write instrunction by thread 2,000 which came in at x+1 th clock cycle lock the memory or will thread 1000 be able to read the global address (pointer_address) so as to compute the correct address to store the value into memory

Is there some sort of way of passing the value read by atomicAdd to the thread

Will this work or is there any other way to make this work.?

Also i would also like to know if GPU has some form of inorder commit guarantee, i.e lets say i do a simple write to position A at clock cycle x; will a read to the same position A at cycle x+1 be deterministic, i.e will it read the value written at prev clock cycle , In essence my question is whether there can be race conditions even when instructions are issued at different clock cycles or is there some form of Re- Order Buffer sort of structure which prevents such kind of RAW hazards. I guess this data is kinda useless for the programmer to know since blocks dont execute in any particular order on the SMs but am just curious .

p.s I did try to search the forum but all it said it is possible to do it using global atomics but no implementation was discussed

You want to read the previous value from atomicAdd:

if ( some_condition ) {

	unsigned int current_address = atomicAdd( pointer_address, 1 );

	buffer[current_address] =  value_computed;

}

If your ratio is really 10^6 threads to 100 writes, then the performance of this method should be OK as the number of collisions waiting on the atomic op will be small.

If I understand it correclty, __threadfence (coming in CUDA 2.2) will do this.

Hey thanks for the info I rather foolishly missed the fact that there was a return associated with atomic functions

Lets say the writes are more than 100 say 1000 wouldn’t initializing 10 different such buffers speed the process up as there would be less lock up.
And would some sort of a hierarchical approach help speed up things as in have a small buffer in shared memory per block so there is access to the global mem address only once per block

Both of those suggestions are reasonable. See the histogram sdk projects for examples of programs doing that.