My problem is that I am launching about 10^6 threads, they check for some condition and about 100 of them write some value that its computed.
A simple way would be write 10^6 ones or zeros into a array and then perform a scan on it and then compute the offsets needed and load these values back and then do the writes .
Another option which i am not very sure of is to pre-declare a buffer of about 1000 values and then have a pointer which points to the tail of the buffer and use Atomic Adds to increment it and then read the value just after the increment which tells the thread where i have to put the value.
Lets say this is the code
if ( some_condition ) {
atomicAdd( pointer_address, 1 );
buffer[pointer_address] = value_computed;
}
Lets say thread no 1000 starts executing this bit of code at x clock cycle and thread 2000 starts at time x+1 ;
obviously thread 1000s atomic add would lock access to memory loctation pointer_address, till it finishes incrementing it.
Now will the next atomic write instrunction by thread 2,000 which came in at x+1 th clock cycle lock the memory or will thread 1000 be able to read the global address (pointer_address) so as to compute the correct address to store the value into memory
Is there some sort of way of passing the value read by atomicAdd to the thread
Will this work or is there any other way to make this work.?
Also i would also like to know if GPU has some form of inorder commit guarantee, i.e lets say i do a simple write to position A at clock cycle x; will a read to the same position A at cycle x+1 be deterministic, i.e will it read the value written at prev clock cycle , In essence my question is whether there can be race conditions even when instructions are issued at different clock cycles or is there some form of Re- Order Buffer sort of structure which prevents such kind of RAW hazards. I guess this data is kinda useless for the programmer to know since blocks dont execute in any particular order on the SMs but am just curious .
p.s I did try to search the forum but all it said it is possible to do it using global atomics but no implementation was discussed