I m trying to implement Parallel reduction following the SDK example:

global void reduce3(float *g_idata, float *g_odata,int NN)


__shared__ float sdata[threadsPerBlock];

// perform first level of reduction,

// reading from global memory, writing to shared memory

unsigned int tid = threadIdx.x;  // THREAD INDEX

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];


// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

    if (tid < s) {

		printf("tid = %d \n",tid);

        sdata[tid] += sdata[tid + s];




// write result for this block to global mem 

if (tid == 0) g_odata[blockIdx.x] = sdata[0];


The above code reduce elements in a block, but after I should reduce one more time…I have to sum numer_block elements. Is it right?


100 element ----> threadPerBlock = 10 -----> DimBlock = 5


reduce3<<< threadPerBlock, DimBlock>>>(vec1 , vec2)


for (i=0;i<DimBlock;i++) sum += vec2[i]


ypu can use atomic on global memory or invoke another kernel to sum remaining vector

Thank you for reply.

Can someone tell me how use atomic in this case?..I dont want to invoke another kernel… Thanks a lot!