Each thread calculates a value. I then need to find the highest value across all the threads, and if a particular thread
holds that value I then need to store it’s index. Yep, reduction I hear, but reduction results in one thread having the answer.
I need that answer to be then available to all the threads which seems to be a problem.
e.g. (Psuedo code here to show my point (without much reduction!))
if(i>sharedBest) atomicMax(&sharedBest,i); // Find the best in this block
__syncthreads(); // Wait for block to sync
if(indexWithinBlock==0) // Are we the first thread in the block?
while(count>0); // Wait for other blocks to reach this point, syncthreads won’t work here
What happens is I get different values for res->best_z due to all the atomicMax(&res->best,sharedBest) lines not completing, even with my while(count>0).
As __syncthreads only syncs a block, and the documention says there is no mechanism for global synchronization I am stuck.
This must be a common enough scenario, any suggestions??