i have this code for finding the min value of an array ( small sized ).I run this in one block
nTotalThreads = blockDim.x;
while(nTotalThreads > 1)
{
halfPoint = (nTotalThreads >> 1); // divide by two
// only the first half of the threads will be active.
if (nTotalThreads%2 == 0 )
{
if (idx < halfPoint)
{
// Get the shared value stored by another thread
temp = time_array[idx + halfPoint].t;
if (temp < time_array[idx].t)
{
time_array[idx].t = temp;
time_array[idx].pos = time_array[idx + halfPoint].pos;
}
}
}
else
{
if (idx <= halfPoint)
{
// Get the shared value stored by another thread
temp = time_array[idx + halfPoint].t;
if (temp < time_array[idx].t)
{
time_array[idx].t = temp;
time_array[idx].pos = time_array[idx + halfPoint].pos;
}
}
}
__syncthreads();
if (nTotalThreads%2 == 0 )
nTotalThreads = (nTotalThreads >> 1); // divide by two.
else
nTotalThreads = (nTotalThreads >> 1) + 1;
}
BUT when i run the code below ( in serial program ) is faster ( i think we expect this because the size of the array is small )
for(i=0;i<N;i++)
{
if(mint > akis[i])
mint = akis[i];
}
What i have to do in order to accelarate the parallel version???
Use a parallel reduction. There is example code and a paper in the CUDA SDK. There are also convenient pre-written parallel reduction routines in the thrust template library.
Reductions work on any associative operator. Of course, if the in silico operator isn’t quite associative, results will vary slightly. However, that isn’t the case for min.
It is called reduction and has been in every SDK release for roughly the last three years. There is also a version here, and the thrust templates I mentioned are discussed here.
A reduction is a general data parallel operation which can be applied using any associative operator.
Even still, a parallel reduction will perform better than the bisection(?) approach you are using now. The key to performance is memory throughput, and something like this:
because my program is an algorith for biology (Gillespie FRM method) I take the value in the time array from time_array[idx] = (1/a)*logf(1/r), r is a random number from curand, it is difficult two position in array have the same value. But If have two same minimum value I don’t care which one I’ll take but I must know if it was in position 2 or 100.
Every thread is a reaction and I must know which reaction will run every time this is the reason why I have the time_array with structs type data_t.
So the problem is really to find a blockwise minimum and record the position of the minimum of a data set where each data point is provided by a single thread? So the maximum number of points in the data set to search is the number of threads in a block?
Blockwise minimum just means the minimum of the value set over the block (as opposed to at some other scope like a global minimum).
If you already have the block data in a shared memory array, then the “tree” like shared memory reduction technique that has now been suggested three times is the most efficient way to reduce that to a blockwise minimum. Something like this:
__shared__ volatile data_t data[BLOCKSIZE];
// whatever your code does
__syncthreads();
// Here mindata(a,b) sets a to the minimum of a and b
if (threadIdx.x < 32) {
for(int i=32+threadIdx.x; i < BLOCKSIZE; i+= 32) {
mindata(data[threadIdx.x], data[i]);
}
if (threadIdx.x < 16) mindata(data[threadIdx.x], data[threadIdx.x+16]);
if (threadIdx.x < 8) mindata(data[threadIdx.x], data[threadIdx.x+8]);
if (threadIdx.x < 4) mindata(data[threadIdx.x], data[threadIdx.x+4]);
if (threadIdx.x < 2) mindata(data[threadIdx.x], data[threadIdx.x+2]);
if (threadIdx.x == 0) mindata(data[0], data[1]);
}
// Blockwise minimum now held in data[0]
The volatile declaration is important if you are using a Fermi card, because there can be problems with compiler optimisation breaking the implicit warp level synchronization which the reduction relies on.
Also the mindata is command in cuda.h? In this code I find the minimum but what variable has the position of the original array? If block has more than 100 thead this code work?
17 threads per block is incredibly wasteful - 40% of every active warps CPU cycles aren’t used and you will not be able to cover any of the fixed pipeline latency in the architecture. In real terms, you will be wasting about 90% of the GPU capacity.