Find the Maximum value among 16 threads

Hi there!!!

I have this peace of code calculating the maximum value among 16 threads, but the problem I’m launching the kernel inside a “for” cycle and I think with a little tweaking in the kernel I will not need this launch method. I just need help to do it.

The code goes like this:

DEVICE CODE

global void ReductionKernel(float* pVector, int N, float* bigger)
{

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x* blockDim.x+ threadIdx.x;

__shared__ float sh[16];

if(i<N)
	sh[tid]= pVector[i];
else
	sh[tid]=0;
__syncthreads();


for(int s=blockDim.x/2;s>0;s=s/2){
	
	if(tid<s)
		if(sh[tid+s] > sh[tid])
			sh[tid]=sh[tid+s];
	
	__syncthreads();

}

if(tid==0)
		bigger[0]=sh[tid];

}

HOST CODE

and I’m using this to launch it:

for (int i=0;i<numvertex;i++)
{

	radiusTemp(vTempHost,radiusHost,i); //fuction to load to vTemp the values for the kernel to calculate the maximum.
			
	cudaMemcpy(vTempDevice, vTempHost, sizeTemp, cudaMemcpyHostToDevice);
	
	ReductionKernel<<<1,BLOCKSIZE>>>(vTempDevice,16,biggestDevice);
	
	cudaMemcpy(vTempHost, vTempDevice, sizeTemp, cudaMemcpyDeviceToHost);
	cudaMemcpy(biggestHost, biggestDevice, sizeof(float), cudaMemcpyDeviceToHost);
	
	vectorRadiusTemp[i]=biggestHost[0];
	
}

If someone know how to change this so I will not need this “for” cycle, because it is taking to long, I think is possible to optimize it, but I can’t figure it out.

Is it feasible to have a grid size equal to ‘numvertex,’ and have each block take care of the body of the above for loop in parallel?

thank you for fast reply

What I’m thinking is a block of 16 threads and then define the grid by dividing the size of the 2D array (numvertex * 16) , for example:

ThreadCount = numvertex * 16;

(ThreadCount + (15))/16;

I hope it helps you