problem with shared mamery

I’m new to CUDA and I try to write a simple program with shared memory.
I tried to write a simple sum of array elements with CUDA. I used this algorithm to find the sum of elements:

for d := 0 to logn - 1 do
for k from 0 to n – 1 by 2^d + 1 in parallel do
x[k + 2^(d + 1) - 1] := x[k + 2^d - 1] + x [k + 2^(d + 1) - 1]

I want to have just a simple program , so I have just 512 elements, so I need 256 threads that fix in one block.
when I used global memory there is no problem test passed, but when I try to use shared memory the result is not the same as what in sequential algorithm on CPU, and test didn’t pass.
I try it with less elements , and i see that if i had less and equal to 8 elements , in case of using shard memory , the test passed otherwise isn’t.
notice: number of elements is in form of n=pow(2,i)

my code is:

// ===----------------- MP3 - Modify this function ---------------------===
//! @param g_idata input data in global memory
// result is expected in index 0 of g_idata
//! @param n input number of elements to scan from input data
// ===------------------------------------------------------------------===
global void reduction(float *g_data, int n)
int d,k;
extern device shared float temp;

int thid = threadIdx.x;

int ai ,bi;

//load aray to shared memory

temp[2*(thid+1)-2] = g_data[2*(thid+1)-2];
temp[2*(thid+1)-1] = g_data[2*(thid+1)-1];


//for(d =0 ; n >= (int)pow(2.0,(double)d+1); d++){
for(d =0 ; n >= (1<<(d+1)); d++){

	k = thid*(1<<(d+1));
	temp[k+(1<<(d+1))-1] += temp[k+(1<<d)-1]; //in case of using shared memory
         //g_data[k+(1<<(d+1))-1] += g_data[k+(1<<d)-1];///in case of using global memory


if(thid == 0)
	g_data[0]= temp[n-1]; //in case of using shared memory
	//g_data[0] = g_data[n-1]; //in case of using global memory


and it is how i call this kernel:

    dim3 dimGrid(1,1);
dim3 dimBlock(num_elements/2,1);

reduction<<<dimGrid,dimBlock>>>(d_data, num_elements);

Why when I use shared memory program doesn’t work for number of elements elements more than 8.

Would you please help me?

The problem is that the number of iterations of the inner loop is not fixed (i.e. dependent to d) while your CUDA program assumes this number always equal to the number of threads.

Your algorithm still contains a degree of recursion. To gain a good speed-up with CUDA, you may need to get rid of it first.

I quickly read your code, You are using extern shared memory?
But I didn’t allocate sizeof your shared memory in the kernel calling code.
reduction<<<dimGrid,dimBlock>>>(d_data, num_elements);
reduction<<<dimGrid, dimBlock, 0, shared_memory_size_in_Byte>>>(d_data, num_elements);

I changed my cod to:

reduction<<<dimGrid, dimBlock, shared_memory_size_in_Byte , 0>>>(d_data, num_elements);

It works, now.

Thanks for your help.

Oho, I am sorry for my first post. I didn’t check the syntax.