Hello to everyone,
I m trying to implement Parallel reduction following the SDK example:
[codebox]global void reduce3(float *g_idata, float *g_odata,int NN)
{
__shared__ float sdata[threadsPerBlock];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x; // THREAD INDEX
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
printf("tid = %d \n",tid);
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}[/codebox]
The above code reduce elements in a block, but after I should reduce one more time…I have to sum numer_block elements. Is it right?
Example:
100 element ----> threadPerBlock = 10 -----> DimBlock = 5
[codebox]
…
reduce3<<< threadPerBlock, DimBlock>>>(vec1 , vec2)
sum=0;
for (i=0;i<DimBlock;i++) sum += vec2[i]
…
[/codebox]