Reduction in kernel function Get max value in vector

Hi, All! Please give me answer on next question: how can get max value in vector (single dimension array) using parallel reduction?
I find kernel function for get sum of vector, but I can’t understand: how I can get max value?
I know about Thrust library and function thrust::max_element, BUT thrust function work with array which allocate on host.
In my case i have vector in global memory on GPU. Any suggestion?

Sorry for my bad english!:)

Look at the reduction example from the SDK and replace addition with the [font=“Courier New”]max()[/font] function.

__device__ void warp_reduce_max(volatile float* smem)

{

	smem[threadIdx.x] = smem[threadIdx.x+32] > smem[threadIdx.x] ? 

						smem[threadIdx.x+32] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+16] > smem[threadIdx.x] ? 

						smem[threadIdx.x+16] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+8] > smem[threadIdx.x] ? 

						smem[threadIdx.x+8] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+4] > smem[threadIdx.x] ? 

						smem[threadIdx.x+4] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+2] > smem[threadIdx.x] ? 

						smem[threadIdx.x+2] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+1] > smem[threadIdx.x] ? 

						smem[threadIdx.x+1] : smem[threadIdx.x]; 

}

So this is of course on a per warp basis so you would need for example:

if(threadIdx.x < 32)

warp_reduce_max(smem)

Jimmy Pettersson, tera great tnx!! I try use this example code and then should post result in this topic.

Jimmy Pettersson you can help me with my program? I took your example for search min_max value, and then change this program for my program. I took my_min_max_w_indeces.cu, little change this for me and compile it. And this example work fine! But when I include this in my main program - your code work fine only for size <= 64… And I search long time error, but i can’t find where arises this error.

I attach my main project this, you can dowload this and run it? Maybe you can help me find problem in this code. IDE VS 2010, CUDA SDK 3.2.

Repeat my main question - why code for search max value work fine witn vector size 64*64 or smaller, but not work with greater demension?
Cudasson.zip (700 KB)

Another strange artifact:

in function find_min_max_dynamic i want save max value in global device variable MAX_VALUE

__device__  float MAX_VALUE;

...

// this code work fine when vector size =< 64*64

if(threadIdx.x == 0) {

		out[blockIdx.x + gridDim.x] = smem_max[threadIdx.x]; //return max		

		<b>MAX_VALUE = out[2];</b> // why max value located in out[2] - not in out[1]?

If i will write like

if(threadIdx.x == 0) {

		out[blockIdx.x + gridDim.x] = smem_max[threadIdx.x]; //return max		

		<b>MAX_VALUE = out[blockIdx.x + gridDim.x];</b> // why max value located in out[2] - not in out[1]?

then i dont get max value never.

I find example of very fast reduction kernel

template <unsigned int blockSize>

__device__ void warpReduce(volatileint *sdata, unsigned int tid) {

if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

if(blockSize >= 8) sdata[tid] += sdata[tid + 4];

if (blockSize >= 4) sdata[tid] += sdata[tid + 2];

if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}

template <unsigned int blockSize>

__global__ voidreduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockSize*2) + tid;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

while (i < n){sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }

__syncthreads();

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

if (tid < 32)warpReduce(sdata, tid);

if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Anybody can tell me where i must change this code that get max value instead of sum?

As you can see the number 64 and const int threads = 64; appears in that code snippet. To make it work for larger size you would have to increase the shared memory usage as well as modify the ‘threads’ parameter.

constant_ I try do this, but then function find_min_max)dinamic return wrong value = 0.00000.
Whan I run example for get max value - example work fine for all size of array without change const int threads = 64 (or 64, or 1024 - all works fine), but when I include example in my program - not work! And i can not find bug(