I am trying to get the index of the minimum values on each block of my array of doubles.
I am aware of the reduction optimisations but I am not sure if I can use them as my number of thread in a block is not even.
The user indirectly inputs the number of threads in a block (but not the number of blocks) and operations are done idependently on each block. Then I have to get the index (and not the value) of the minimum in each block and retrieve it.
I tried several things but I can only use it on an even number of threads in a block:
extern __shared__ int z[];
__global__ void gpu_getMin4_2D( double *R, int *res, double thr ){
int id = blockIdx.x * blockDim.x + threadIdx.x ;
int nTotalThreads = blockDim.x;
int halfPoint, n;
double tmp;
z[threadIdx.x] = threadIdx.x;
while( nTotalThreads > 1 )
{
halfPoint = ( nTotalThreads >> 1 ); /* Dividy by two */
/*First half of the block will be active */
if( threadIdx.x < halfPoint )
{
n = id + halfPoint;
tmp = R[ n ];
...
Can you just round up the number of threads the user requests and have the extra threads not do anything until you get to the reduction? If not to a power of 2, then at least to the next multiple of 32?
Yes I can do that but lots of reduction optimisations are based on reaching data in other blocks and I can’t do that. Moreover if I work in a block and about 32 threads of my block is waiting, that would not be really optimized, is it ?
Someone has probable has done that, in an elegant way maybe.
You don’t need to use a strictly binary reduction scheme. For scanning small arrays I have used simple two stage reduction schemes before, where a small number of threads scans the array, each doing there own reduction on the elements visited. Following a __syncthreads() call, one thread then did the final reduction across the accumulated results from the first step.
As an extension to bigger arrays, you could use the first 2^n threads in your thread block do the linear scanning, producing 2^n partial results, and then switch to the binary scheme. You would choose n to be as large as possible. For example, if your thread block has 201 threads, the first 128 threads would do the initial scan, followed by a strict binary scheme.
Typically I have a block size of 500-1000 and I make 1 million of blocks. I call that reduction function a lot in my program so I would like to make it the faster I can.
__global__ void myMinkernel(double *R, int *res, int pow2,double thr,int total){
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
int s;
double tmpmin;
int tmpidmin;
z[tid] = R[i];
z2[tid] = tid;
z3[tid] = thr*10.0;
__syncthreads();
for (s=pow2/2; s!=x0; s>>=1) {
if (tid < s) {
if(z[tid + s] < z[tid]){
z2[tid] = z2[tid + s];
z[tid] = z[tid + s];
}
}
__syncthreads();
}
//each z[0] is the min and each z2[0] is the index of the min in the block
if(tid == 0){
tmpmin = z[0];
tmpidmin = z2[0];
for(s=pow2;s<total;s++){
if(z[s]<tmpmin){
tmpmin = z[s];
tmpidmin = s;
}
}
if(tmpmin < thr)
res[blockIdx.x] = tmpmin;
else
res[blockIdx.x] = -1;
}
}
The last part of the kernel can be improved but I wanted something that works. Maybe someone has suggestions.