Hi, anyone,

I’m a freshman in CUDA and parallel programm.

Now I’m performing correlation of two signals with CUDA. And the problem is that when I accumulate the partial sum of each block in the kernel, the gpu result is zero. The kernel code is as follows,

**global** void

reduce0_kernel( float* g_i1, float* g_i2, float* g_odata, unsigned int n)

{

// shared memory // the size is determined by the host application

extern **shared** float sdata;

// access thread id

unsigned int tid = threadIdx.x;

// access number of threads in this block

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

//multiplication

sdata[tid] = (i < n) ? (g_i1[i]*g_i2[i]) : 0;

__syncthreads();

for(unsigned int s=1;s<blockDim.x;s*=2)

{

if(tid % (2*s) == 0)

sdata[tid] += sdata[tid+s];

__syncthreads();

}

if(tid==0)

g_odata[blockIdx.x] = sdata[0];

for(unsigned int k=1;k<blockIdx.x;k++)

g_odata[0] += g_odata[i]; // accumulate the partial sum of each block

}

can anyone tell me why? thank you very much

Best regards,