Reduction for a maximum value for all threads?

Problem:
Each thread calculates a value. I then need to find the highest value across all the threads, and if a particular thread
holds that value I then need to store it’s index. Yep, reduction I hear, but reduction results in one thread having the answer.
I need that answer to be then available to all the threads which seems to be a problem.

e.g. (Psuedo code here to show my point (without much reduction!))
[i][b][font=“Courier New”]
shared sharedBest;

if(threadIndex==0)
{
res->best=0;
count=blockCount;
}

if(indexWithinBlock==0) sharedBest=0;

int i=;

if(i>sharedBest) atomicMax(&sharedBest,i); // Find the best in this block

__syncthreads(); // Wait for block to sync

if(indexWithinBlock==0) // Are we the first thread in the block?
{
if(sharedBest>res->best) atomicMax(&res->best,sharedBest);
atomicSub(&count,1);
}

while(count>0); // Wait for other blocks to reach this point, syncthreads won’t work here

if(res->best==i) res->best_z=threadIndex;[/font][/b][/i]

What happens is I get different values for res->best_z due to all the atomicMax(&res->best,sharedBest) lines not completing, even with my while(count>0).
As __syncthreads only syncs a block, and the documention says there is no mechanism for global synchronization I am stuck.

This must be a common enough scenario, any suggestions??

Simon

The only way to globally synchronize is to launch a new kernel.

I did think of that. I would have to store the values generated in the threads in global memory and

then launch another kernel to find the highest value. Maybe quicker to just write the calculate values

back to the host…

use thrust for the reduction, with a counting iterator you can get the index and the value:

//assuming you have a float* gpu pointer named d_ptr and sour data is sizeof N

thrust::device_ptr<float> t_ptr = thrust::device_pointer_cast (d_ptr);

thrust::tuple<float,int> t = thrust::reduce (

  thrust::make_zip_iterator(thrust::make_tuple(t_ptr, thrust::make_counting_iterator<int>(0))),

  thrust::make_zip_iterator(thrust::make_tuple(t_ptr + N, thrust::make_counting_iterator<int>(N))),

  thrust::maximum<thrust::tuple<float, int> >()

);

i am not quite sure if tuple supports operator< so if it doesn’t work you should create a simple functor:

struct max_first {

__device__ __host__ tuple<float, int> operator()(tuple<float, int> a, tuple<float, int> b)

  {

    if (get<0>(a) > get<0>(b)) return a;

    return b;

  }

};