Problem:
Each thread calculates a value. I then need to find the highest value across all the threads, and if a particular thread
holds that value I then need to store it’s index. Yep, reduction I hear, but reduction results in one thread having the answer.
I need that answer to be then available to all the threads which seems to be a problem.
e.g. (Psuedo code here to show my point (without much reduction!))
[i][b][font=“Courier New”] shared sharedBest;
What happens is I get different values for res->best_z due to all the atomicMax(&res->best,sharedBest) lines not completing, even with my while(count>0).
As __syncthreads only syncs a block, and the documention says there is no mechanism for global synchronization I am stuck.
This must be a common enough scenario, any suggestions??
use thrust for the reduction, with a counting iterator you can get the index and the value:
//assuming you have a float* gpu pointer named d_ptr and sour data is sizeof N
thrust::device_ptr<float> t_ptr = thrust::device_pointer_cast (d_ptr);
thrust::tuple<float,int> t = thrust::reduce (
thrust::make_zip_iterator(thrust::make_tuple(t_ptr, thrust::make_counting_iterator<int>(0))),
thrust::make_zip_iterator(thrust::make_tuple(t_ptr + N, thrust::make_counting_iterator<int>(N))),
thrust::maximum<thrust::tuple<float, int> >()
);
i am not quite sure if tuple supports operator< so if it doesn’t work you should create a simple functor:
struct max_first {
__device__ __host__ tuple<float, int> operator()(tuple<float, int> a, tuple<float, int> b)
{
if (get<0>(a) > get<0>(b)) return a;
return b;
}
};