Correlation on GPU

sdtougao · October 21, 2010, 8:17am

Hi, anyone,

I’m a freshman in CUDA and parallel programm.

Now I’m performing correlation of two signals with CUDA. And the problem is that when I accumulate the partial sum of each block in the kernel, the gpu result is zero. The kernel code is as follows,

global void
reduce0_kernel( float* g_i1, float* g_i2, float* g_odata, unsigned int n)
{
// shared memory // the size is determined by the host application
extern shared float sdata;

// access thread id
unsigned int tid = threadIdx.x;
// access number of threads in this block
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

//multiplication
sdata[tid] = (i < n) ? (g_i1[i]*g_i2[i]) : 0;
__syncthreads();

for(unsigned int s=1;s<blockDim.x;s*=2)
{
if(tid % (2*s) == 0)
sdata[tid] += sdata[tid+s];
__syncthreads();
}

if(tid==0)
g_odata[blockIdx.x] = sdata[0];

for(unsigned int k=1;k<blockIdx.x;k++)
g_odata[0] += g_odata[i]; // accumulate the partial sum of each block
}

can anyone tell me why? thank you very much

Best regards,

sdtougao · October 21, 2010, 8:17am

Hi, anyone,

I’m a freshman in CUDA and parallel programm.

Now I’m performing correlation of two signals with CUDA. And the problem is that when I accumulate the partial sum of each block in the kernel, the gpu result is zero. The kernel code is as follows,

global void
reduce0_kernel( float* g_i1, float* g_i2, float* g_odata, unsigned int n)
{
// shared memory // the size is determined by the host application
extern shared float sdata;

// access thread id
unsigned int tid = threadIdx.x;
// access number of threads in this block
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

//multiplication
sdata[tid] = (i < n) ? (g_i1[i]*g_i2[i]) : 0;
__syncthreads();

for(unsigned int s=1;s<blockDim.x;s*=2)
{
if(tid % (2*s) == 0)
sdata[tid] += sdata[tid+s];
__syncthreads();
}

if(tid==0)
g_odata[blockIdx.x] = sdata[0];

for(unsigned int k=1;k<blockIdx.x;k++)
g_odata[0] += g_odata[i]; // accumulate the partial sum of each block
}

can anyone tell me why? thank you very much

Best regards,

sdtougao · October 22, 2010, 1:28pm

Does anyone help me? thanks a lot

sdtougao · October 22, 2010, 1:28pm

Does anyone help me? thanks a lot

gshi · October 22, 2010, 2:49pm

The accumulation of the partial sum of each block has to happen in another kernel or in cpu.

gshi · October 22, 2010, 2:49pm

The accumulation of the partial sum of each block has to happen in another kernel or in cpu.

yyfn · October 23, 2010, 1:24am

Hi, anyone,

I’m a freshman in CUDA and parallel programm.

Now I’m performing correlation of two signals with CUDA. And the problem is that when I accumulate the partial sum of each block in the kernel, the gpu result is zero. The kernel code is as follows,

global void

reduce0_kernel( float* g_i1, float* g_i2, float* g_odata, unsigned int n)

{

// shared memory // the size is determined by the host application

extern shared float sdata;

// access thread id

unsigned int tid = threadIdx.x;

// access number of threads in this block

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

//multiplication

sdata[tid] = (i < n) ? (g_i1[i]*g_i2[i]) : 0;

__syncthreads();

for(unsigned int s=1;s<blockDim.x;s*=2)

{
  if(tid % (2*s) == 0)

	  sdata[tid] += sdata[tid+s];

  __syncthreads();
}

if(tid==0)
  g_odata[blockIdx.x] = sdata[0];
for(unsigned int k=1;k<blockIdx.x;k++)
  g_odata[0] += g_odata[i]; // accumulate the partial sum of each block
}

can anyone tell me why? thank you very much

Best regards,

maybe you need to have a look at the reduction algorthm!

yyfn · October 23, 2010, 1:24am

Hi, anyone,

I’m a freshman in CUDA and parallel programm.

Now I’m performing correlation of two signals with CUDA. And the problem is that when I accumulate the partial sum of each block in the kernel, the gpu result is zero. The kernel code is as follows,

global void

reduce0_kernel( float* g_i1, float* g_i2, float* g_odata, unsigned int n)

{

// shared memory // the size is determined by the host application

extern shared float sdata;

// access thread id

unsigned int tid = threadIdx.x;

// access number of threads in this block

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

//multiplication

sdata[tid] = (i < n) ? (g_i1[i]*g_i2[i]) : 0;

__syncthreads();

for(unsigned int s=1;s<blockDim.x;s*=2)

{
  if(tid % (2*s) == 0)

	  sdata[tid] += sdata[tid+s];

  __syncthreads();
}

if(tid==0)
  g_odata[blockIdx.x] = sdata[0];
for(unsigned int k=1;k<blockIdx.x;k++)
  g_odata[0] += g_odata[i]; // accumulate the partial sum of each block
}

can anyone tell me why? thank you very much

Best regards,

maybe you need to have a look at the reduction algorthm!

Topic		Replies	Views
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009
I want to ask parallel reduction.... CUDA Programming and Performance	0	483	January 30, 2019
IS __syncthread() resetting shared memory values? CUDA Programming and Performance	2	711	August 9, 2018
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1270	November 8, 2023
Parallel reduction problem CUDA Programming and Performance	1	5079	November 29, 2010
Result of reduction in GPU do not match with the CPU's, also GPU's result vary with blocksize Legacy PGI Compilers	4	872	June 23, 2020
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2195	November 23, 2009
Summation CUDA Programming and Performance	10	8317	November 20, 2008
cuda reduction kernel from example doesn't run CUDA Programming and Performance	5	1570	February 13, 2013
full warp Vs. half warp coalesced global memory loads CUDA Programming and Performance	7	8021	May 27, 2009

Correlation on GPU

Related topics