Reduction for a maximum value for all threads?

StanfordYell · July 28, 2011, 7:35pm

Problem:
Each thread calculates a value. I then need to find the highest value across all the threads, and if a particular thread
holds that value I then need to store it’s index. Yep, reduction I hear, but reduction results in one thread having the answer.
I need that answer to be then available to all the threads which seems to be a problem.

e.g. (Psuedo code here to show my point (without much reduction!))
[i][b][font=“Courier New”]
shared sharedBest;

if(threadIndex==0)
{
res->best=0;
count=blockCount;
}

if(indexWithinBlock==0) sharedBest=0;

int i=;

if(i>sharedBest) atomicMax(&sharedBest,i); // Find the best in this block

__syncthreads(); // Wait for block to sync

if(indexWithinBlock==0) // Are we the first thread in the block?
{
if(sharedBest>res->best) atomicMax(&res->best,sharedBest);
atomicSub(&count,1);
}

while(count>0); // Wait for other blocks to reach this point, syncthreads won’t work here

if(res->best==i) res->best_z=threadIndex;[/font][/b][/i]

What happens is I get different values for res->best_z due to all the atomicMax(&res->best,sharedBest) lines not completing, even with my while(count>0).
As __syncthreads only syncs a block, and the documention says there is no mechanism for global synchronization I am stuck.

This must be a common enough scenario, any suggestions??

Simon

tera · July 28, 2011, 8:04pm

The only way to globally synchronize is to launch a new kernel.

StanfordYell · July 28, 2011, 11:41pm

I did think of that. I would have to store the values generated in the threads in global memory and

then launch another kernel to find the highest value. Maybe quicker to just write the calculate values

back to the host…

devkec · August 1, 2011, 12:31am

use thrust for the reduction, with a counting iterator you can get the index and the value:

//assuming you have a float* gpu pointer named d_ptr and sour data is sizeof N

thrust::device_ptr<float> t_ptr = thrust::device_pointer_cast (d_ptr);

thrust::tuple<float,int> t = thrust::reduce (

  thrust::make_zip_iterator(thrust::make_tuple(t_ptr, thrust::make_counting_iterator<int>(0))),

  thrust::make_zip_iterator(thrust::make_tuple(t_ptr + N, thrust::make_counting_iterator<int>(N))),

  thrust::maximum<thrust::tuple<float, int> >()

);

i am not quite sure if tuple supports operator< so if it doesn’t work you should create a simple functor:

struct max_first {

__device__ __host__ tuple<float, int> operator()(tuple<float, int> a, tuple<float, int> b)

  {

    if (get<0>(a) > get<0>(b)) return a;

    return b;

  }

};

Topic		Replies	Views
Many threads updating a single global variable CUDA Programming and Performance	7	6799	March 30, 2012
Synchronization across Multiple blocks Is there any way to call sync threads across mutiple blocks?? CUDA Programming and Performance	1	5160	January 19, 2010
concurrent memory writes CUDA Programming and Performance	8	5550	September 15, 2008
Find maximum value from threads CUDA Programming and Performance	6	454	December 16, 2023
Question regarding summing up outputs Summing outputs from each thread CUDA Programming and Performance	10	8038	March 12, 2008
Find the largest value among all threads result In CPU or GPU? CUDA Programming and Performance	8	8357	November 29, 2007
One question regarding shared memory CUDA Programming and Performance	5	1241	April 24, 2013
Cumpute Max of Vector or Matrix CUDA Programming and Performance	7	3781	June 6, 2011
"any"/"all" boolean operation between threads Efficient thread co-oporation CUDA Programming and Performance	6	2986	February 18, 2008
Cuda : Reduce (max/min) function on matrix implementation CUDA Programming and Performance	1	1665	August 22, 2019

Reduction for a maximum value for all threads?

Related topics