Finding Minimum in array

Hey Everybody , im trying to find the minimum variable in an array using CUDA reduction algorithm , but for some reason

it doesn’t work.

the call for the function :

findMin<<<blocks,THREADS_PER_BLOCK,blocks>>> (foundPoints,foundPointOnDev,MAXX * MAXY);

in this case blocks = 512

the foundPoints is a MAXX * MAXY array defined on the DEVICE,

foundPointOnDev is the variable declared on the DEVICE to hold the smallest number

the function :

__global__ void findMin(float4 *inPoints,float4 *outPoint,int n)


	extern __shared__ float4 sdata[];

	// load shared mem

    unsigned int tid = threadIdx.x;

    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	float4 infinitePoint;

	/*infinitePoint.x = 9999;	infinitePoint.y = 9999;	infinitePoint.z = 9999;*/

        infinitePoint.w = 9999;


	//load everything to the shared memory

    sdata[tid] = (i < n) ? inPoints[i] : infinitePoint;



	for(unsigned int s=blockDim.x/2; s>0; s>>=1) 


		if (tid < s) 


			if (sdata[tid].w < sdata[tid + s].w)

				sdata[tid] = sdata[tid + s];




	if (tid == 0) *outPoint = sdata[0];


the problem is that im not getting the result i need

There seems to be no mechanism to find the minimum between blocks. As I read the code, a random block gets the chance to return the minimum within its data as the final result.

djmj1000 thanks for comment! In the future I’ll test my code, before posting it on the forum :P


Here is a complete code package for finding min max… It reaches roughly 80+ % utilization,

Hope it is useful:

my_max_min.rar (26.8 KB)

Hey Jimmy , thanks for the coda , can you give a short explanation about the code?


It was a long time ago since I wrote it ( hope it compiles and runs well ).

It was essentially very fast reduction sum code that I realized could easily be ported to find min and max. It finds both in the same kernel calls. If you want some additional speedup in just finding one of the two you may remove the code for either min or max.

Each block does a huge number of reductions (for large problems) in one kernel. Finally there is a kernel which takes the value and previously stored index and reduces those.