Reduction in kernel function Get max value in vector

sanek_dampir · May 2, 2011, 11:26am

Hi, All! Please give me answer on next question: how can get max value in vector (single dimension array) using parallel reduction?
I find kernel function for get sum of vector, but I can’t understand: how I can get max value?
I know about Thrust library and function thrust::max_element, BUT thrust function work with array which allocate on host.
In my case i have vector in global memory on GPU. Any suggestion?

Sorry for my bad english!:)

tera · May 2, 2011, 12:42pm

Look at the reduction example from the SDK and replace addition with the [font=“Courier New”]max()[/font] function.

Jimmy_Pettersson · May 2, 2011, 2:25pm

__device__ void warp_reduce_max(volatile float* smem)

{

	smem[threadIdx.x] = smem[threadIdx.x+32] > smem[threadIdx.x] ? 

						smem[threadIdx.x+32] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+16] > smem[threadIdx.x] ? 

						smem[threadIdx.x+16] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+8] > smem[threadIdx.x] ? 

						smem[threadIdx.x+8] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+4] > smem[threadIdx.x] ? 

						smem[threadIdx.x+4] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+2] > smem[threadIdx.x] ? 

						smem[threadIdx.x+2] : smem[threadIdx.x]; 

	smem[threadIdx.x] = smem[threadIdx.x+1] > smem[threadIdx.x] ? 

						smem[threadIdx.x+1] : smem[threadIdx.x]; 

}

So this is of course on a per warp basis so you would need for example:

if(threadIdx.x < 32)

warp_reduce_max(smem)

sanek_dampir · May 3, 2011, 11:55am

Jimmy Pettersson, tera great tnx!! I try use this example code and then should post result in this topic.

sanek_dampir · May 5, 2011, 9:37am

Jimmy Pettersson you can help me with my program? I took your example for search min_max value, and then change this program for my program. I took my_min_max_w_indeces.cu, little change this for me and compile it. And this example work fine! But when I include this in my main program - your code work fine only for size <= 64… And I search long time error, but i can’t find where arises this error.

I attach my main project this, you can dowload this and run it? Maybe you can help me find problem in this code. IDE VS 2010, CUDA SDK 3.2.

Repeat my main question - why code for search max value work fine witn vector size 64*64 or smaller, but not work with greater demension?
Cudasson.zip (700 KB)

sanek_dampir · May 5, 2011, 9:57am

Another strange artifact:

in function find_min_max_dynamic i want save max value in global device variable MAX_VALUE

__device__  float MAX_VALUE;

...

// this code work fine when vector size =< 64*64

if(threadIdx.x == 0) {

		out[blockIdx.x + gridDim.x] = smem_max[threadIdx.x]; //return max		

		<b>MAX_VALUE = out[2];</b> // why max value located in out[2] - not in out[1]?

If i will write like

if(threadIdx.x == 0) {

		out[blockIdx.x + gridDim.x] = smem_max[threadIdx.x]; //return max		

		<b>MAX_VALUE = out[blockIdx.x + gridDim.x];</b> // why max value located in out[2] - not in out[1]?

then i dont get max value never.

sanek_dampir · May 5, 2011, 10:18am

I find example of very fast reduction kernel

template <unsigned int blockSize>

__device__ void warpReduce(volatileint *sdata, unsigned int tid) {

if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

if(blockSize >= 8) sdata[tid] += sdata[tid + 4];

if (blockSize >= 4) sdata[tid] += sdata[tid + 2];

if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}

template <unsigned int blockSize>

__global__ voidreduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockSize*2) + tid;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

while (i < n){sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }

__syncthreads();

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

if (tid < 32)warpReduce(sdata, tid);

if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Anybody can tell me where i must change this code that get max value instead of sum?

_constant · May 5, 2011, 11:08am

As you can see the number 64 and const int threads = 64; appears in that code snippet. To make it work for larger size you would have to increase the shared memory usage as well as modify the ‘threads’ parameter.

sanek_dampir · May 5, 2011, 1:13pm

constant_ I try do this, but then function find_min_max)dinamic return wrong value = 0.00000.
Whan I run example for get max value - example work fine for all size of array without change const int threads = 64 (or 64, or 1024 - all works fine), but when I include example in my program - not work! And i can not find bug(

Topic		Replies	Views
Cumpute Max of Vector or Matrix CUDA Programming and Performance	7	3842	June 6, 2011
Finding max in array CUDA Programming and Performance	15	42929	November 26, 2017
Reduction CUDA Programming and Performance	14	9067	August 9, 2010
CUDA reduction CUDA Programming and Performance	10	51567	June 7, 2009
Find maximum value from threads CUDA Programming and Performance	6	570	December 16, 2023
find maximum value in an array along with index CUDA Programming and Performance	40	19076	October 11, 2010
How to obtain the maximum value of the sequence by using the reduction algorithm？ CUDA Programming and Performance	1	379	May 6, 2019
Cuda : Reduce (max/min) function on matrix implementation CUDA Programming and Performance	1	1749	August 22, 2019
different output every time I run my code probably wrong in finding max value. CUDA Programming and Performance	4	3117	June 21, 2011
How to realize that CUDA Programming and Performance	6	1949	February 16, 2010

Reduction in kernel function Get max value in vector

Related topics