shared memory problems

Hi,

I would like to know if the following program has something wrong.
I’m trying to do some simple scan but i can’t get the correct result

global void reduceKernel1(float* _iarray, float* _oarray,int N)
{
//shared memory array
extern shared float sdata;
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;

//each thread load data to shared memory
if(i < blockDim.x){
	sdata[tid] = _iarray[i];
}
//synchronization 
__syncthreads();

//loop over sdata
for(int j = 1; j < blockDim.x; j *= 2)
{
	if(tid % (2*j) == 0)
	{
		sdata[tid] += sdata[tid + j] ;
		__syncthreads();
	}
	__syncthreads();
}

//write back the data to the output array 
if(tid < ARRAYSIZE )
	_oarray[i] = sdata[tid];
__syncthreads();

}

Thank you four your help

my experience is that setting memory size for the shared memory before calling kernel is one factor for errors.

my experience is that setting memory size for the shared memory before calling kernel is one factor for errors.

Sir syoon ,

Thank you for your reply.

Actually I’m trying not to set the shared memory size. This is why it is declared with the extern keyword.

I don’t know if I missunderstand some principles of cuda.

Sir syoon ,

Thank you for your reply.

Actually I’m trying not to set the shared memory size. This is why it is declared with the extern keyword.

I don’t know if I missunderstand some principles of cuda.

[codebox]

	if(tid % (2*j) == 0)

	{

		sdata[tid] += sdata[tid + j] ;

		__syncthreads();

	}

	__syncthreads();[/codebox]

Having a __syncthreads() within an “if” block that doesn’t evaluate the same for all threads can have unexpected side effects.

[codebox]

	if(tid % (2*j) == 0)

	{

		sdata[tid] += sdata[tid + j] ;

		__syncthreads();

	}

	__syncthreads();[/codebox]

Having a __syncthreads() within an “if” block that doesn’t evaluate the same for all threads can have unexpected side effects.

I think you already know more than I do…Just in case…I had some problems with shared memory before…

The following is from ‘tera’ on my question on shared memory usage…

==============

the size must be known at compile time.

What you can do is use the dynamical shared memory allocation feature. Declare a_d as follows:

extern shared float a_d;

and add the required size as third configuration parameter of the kernel invocation:

my_kernel<<<gridsize, blocksize, blocksize.xblocksize.yblocksize.z*sizeof(float)>>>();

Note that this only works for one variable size array.

==============

if you are sure you did right on this part, please disregard my reply.

with ‘tera’ 's help i am now doing fine using dynamic shared memory…

I think you already know more than I do…Just in case…I had some problems with shared memory before…

The following is from ‘tera’ on my question on shared memory usage…

==============

the size must be known at compile time.

What you can do is use the dynamical shared memory allocation feature. Declare a_d as follows:

extern shared float a_d;

and add the required size as third configuration parameter of the kernel invocation:

my_kernel<<<gridsize, blocksize, blocksize.xblocksize.yblocksize.z*sizeof(float)>>>();

Note that this only works for one variable size array.

==============

if you are sure you did right on this part, please disregard my reply.

with ‘tera’ 's help i am now doing fine using dynamic shared memory…

How do you invoke the kernel - can you post that line of code as well?

How do you invoke the kernel - can you post that line of code as well?

Hi sirs,

Thank for your help.
You are completely right. I declared the shared memory external but
was missing to add the required size during the kernel call.

before: reduceKernel1 <<< blocks, threads >>> ();
after: reduceKernel1 <<< blocks, threads ,ARRAYSIZE * sizeof(float)>>> ();
Now everything works normally.

Thank you!

Hi sirs,

Thank for your help.
You are completely right. I declared the shared memory external but
was missing to add the required size during the kernel call.

before: reduceKernel1 <<< blocks, threads >>> ();
after: reduceKernel1 <<< blocks, threads ,ARRAYSIZE * sizeof(float)>>> ();
Now everything works normally.

Thank you!