Reduction on odd number of thread / block


I am trying to get the index of the minimum values on each block of my array of doubles.
I am aware of the reduction optimisations but I am not sure if I can use them as my number of thread in a block is not even.
The user indirectly inputs the number of threads in a block (but not the number of blocks) and operations are done idependently on each block. Then I have to get the index (and not the value) of the minimum in each block and retrieve it.

I tried several things but I can only use it on an even number of threads in a block:

extern __shared__ int z[];

__global__ void gpu_getMin4_2D( double *R, int *res, double thr ){
	int id = blockIdx.x * blockDim.x + threadIdx.x ;
	int	nTotalThreads = blockDim.x;
	int	halfPoint, n;
	double	tmp;

	z[threadIdx.x] = threadIdx.x;

	while( nTotalThreads > 1 )
		halfPoint = ( nTotalThreads >> 1 ); /* Dividy by two */

		/*First half of the block will be active */
		if( threadIdx.x < halfPoint )
			n = id + halfPoint;
			tmp = R[ n ];

Can you just round up the number of threads the user requests and have the extra threads not do anything until you get to the reduction? If not to a power of 2, then at least to the next multiple of 32?

Yes I can do that but lots of reduction optimisations are based on reaching data in other blocks and I can’t do that. Moreover if I work in a block and about 32 threads of my block is waiting, that would not be really optimized, is it ?

Someone has probable has done that, in an elegant way maybe.

You don’t need to use a strictly binary reduction scheme. For scanning small arrays I have used simple two stage reduction schemes before, where a small number of threads scans the array, each doing there own reduction on the elements visited. Following a __syncthreads() call, one thread then did the final reduction across the accumulated results from the first step.

As an extension to bigger arrays, you could use the first 2^n threads in your thread block do the linear scanning, producing 2^n partial results, and then switch to the binary scheme. You would choose n to be as large as possible. For example, if your thread block has 201 threads, the first 128 threads would do the initial scan, followed by a strict binary scheme.

Do you have some code that I can modify ?

Typically I have a block size of 500-1000 and I make 1 million of blocks. I call that reduction function a lot in my program so I would like to make it the faster I can.

I used something like this.

__global__ void myMinkernel(double *R, int *res, int pow2,double thr,int total){

	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
	int s;
	double tmpmin;
	int tmpidmin;
	z[tid] = R[i];
	z2[tid] = tid;
	z3[tid] = thr*10.0;


	for (s=pow2/2; s!=x0; s>>=1) {
		if (tid < s) {
			if(z[tid + s] < z[tid]){
				z2[tid] = z2[tid + s];
				z[tid] = z[tid + s];
        //each z[0] is the min and each z2[0] is the index of the min in the block

	if(tid == 0){
		tmpmin = z[0];
		tmpidmin = z2[0];

				tmpmin = z[s];
				tmpidmin = s;
		if(tmpmin < thr)
			res[blockIdx.x] = tmpmin;
			res[blockIdx.x] = -1;

The last part of the kernel can be improved but I wanted something that works. Maybe someone has suggestions.