Parallel Reduction

GiulioPU · July 7, 2010, 1:01pm

Hello to everyone,

I m trying to implement Parallel reduction following the SDK example:

[codebox]global void reduce3(float *g_idata, float *g_odata,int NN)

{

__shared__ float sdata[threadsPerBlock];

// perform first level of reduction,

// reading from global memory, writing to shared memory

unsigned int tid = threadIdx.x;  // THREAD INDEX

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];

__syncthreads();

// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

    if (tid < s) {

		printf("tid = %d \n",tid);

        sdata[tid] += sdata[tid + s];

    }

    __syncthreads();

}

// write result for this block to global mem 

if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}[/codebox]

The above code reduce elements in a block, but after I should reduce one more time…I have to sum numer_block elements. Is it right?

Example:

100 element ----> threadPerBlock = 10 -----> DimBlock = 5

[codebox]

…

reduce3<<< threadPerBlock, DimBlock>>>(vec1 , vec2)

sum=0;

for (i=0;i<DimBlock;i++) sum += vec2[i]

…

[/codebox]

LSChien · July 8, 2010, 5:15am

ypu can use atomic on global memory or invoke another kernel to sum remaining vector

GiulioPU · July 8, 2010, 9:50am

Thank you for reply.

Can someone tell me how use atomic in this case?..I dont want to invoke another kernel… Thanks a lot!

Topic		Replies	Views
Hybrid Atomic Reduction CUDA Programming and Performance	0	712	June 24, 2013
Using reduction instead of atomics? CUDA Programming and Performance	9	6096	March 9, 2015
I want to ask parallel reduction.... CUDA Programming and Performance	0	516	January 30, 2019
Parallel reduction problem CUDA Programming and Performance	1	5142	November 29, 2010
reduction centric design forces. should Iconsider atomic increment rather than classic reduction CUDA Programming and Performance	0	539	April 4, 2012
How is Device Reduction Implemented? CUDA Programming and Performance	4	214	January 17, 2025
total sum example CUDA Programming and Performance	3	7398	December 2, 2015
Many threads updating a single global variable CUDA Programming and Performance	7	6978	March 30, 2012
Simple (honest!) change to parallel reduction example yields bizarre result? CUDA Programming and Performance	1	2498	December 26, 2011
Accumulate value within block CUDA Programming and Performance	15	3426	October 16, 2010

Parallel Reduction

Related topics