problem with shared mamery

marjan_919 · May 10, 2009, 3:31pm

Hello,
I’m new to CUDA and I try to write a simple program with shared memory.
I tried to write a simple sum of array elements with CUDA. I used this algorithm to find the sum of elements:

for d := 0 to logn - 1 do
for k from 0 to n â€“ 1 by 2^d + 1 in parallel do
x[k + 2^(d + 1) - 1] := x[k + 2^d - 1] + x [k + 2^(d + 1) - 1]

I want to have just a simple program , so I have just 512 elements, so I need 256 threads that fix in one block.
when I used global memory there is no problem test passed, but when I try to use shared memory the result is not the same as what in sequential algorithm on CPU, and test didn’t pass.
I try it with less elements , and i see that if i had less and equal to 8 elements , in case of using shard memory , the test passed otherwise isn’t.
notice: number of elements is in form of n=pow(2,i)

my code is:

// ===----------------- MP3 - Modify this function ---------------------===
//! @param g_idata input data in global memory
// result is expected in index 0 of g_idata
//! @param n input number of elements to scan from input data
// ===------------------------------------------------------------------===
global void reduction(float *g_data, int n)
{
int d,k;
extern device shared float temp;

int thid = threadIdx.x;

int ai ,bi;

//load aray to shared memory

temp[2*(thid+1)-2] = g_data[2*(thid+1)-2];
temp[2*(thid+1)-1] = g_data[2*(thid+1)-1];

__syncthreads();

//for(d =0 ; n >= (int)pow(2.0,(double)d+1); d++){
for(d =0 ; n >= (1<<(d+1)); d++){
	

	k = thid*(1<<(d+1));
	temp[k+(1<<(d+1))-1] += temp[k+(1<<d)-1]; //in case of using shared memory
         //g_data[k+(1<<(d+1))-1] += g_data[k+(1<<d)-1];///in case of using global memory
	__syncthreads();

}

if(thid == 0)
	g_data[0]= temp[n-1]; //in case of using shared memory
	//g_data[0] = g_data[n-1]; //in case of using global memory

}

and it is how i call this kernel:

    num_elements=256;
    ...
    dim3 dimGrid(1,1);
dim3 dimBlock(num_elements/2,1);

reduction<<<dimGrid,dimBlock>>>(d_data, num_elements);
    ...

Why when I use shared memory program doesn’t work for number of elements elements more than 8.

Would you please help me?

cvnguyen · May 11, 2009, 12:26am

The problem is that the number of iterations of the inner loop is not fixed (i.e. dependent to d) while your CUDA program assumes this number always equal to the number of threads.

Your algorithm still contains a degree of recursion. To gain a good speed-up with CUDA, you may need to get rid of it first.

Quoc_Vinh · May 11, 2009, 2:40am

I quickly read your code, You are using extern shared memory?
But I didn’t allocate sizeof your shared memory in the kernel calling code.
reduction<<<dimGrid,dimBlock>>>(d_data, num_elements);
to
reduction<<<dimGrid, dimBlock, 0, shared_memory_size_in_Byte>>>(d_data, num_elements);

marjan_919 · May 11, 2009, 7:37am

I changed my cod to:

reduction<<<dimGrid, dimBlock, shared_memory_size_in_Byte , 0>>>(d_data, num_elements);

It works, now.

Thanks for your help.

Quoc_Vinh · May 11, 2009, 8:59am

Oho, I am sorry for my first post. I didn’t check the syntax.

Topic		Replies	Views
using shared memory CUDA Programming and Performance	6	2931	September 17, 2009
Parallel reduction problem CUDA Programming and Performance	1	5082	November 29, 2010
Shared Mem (w/ & w/out extern) CUDA Programming and Performance	2	2294	October 2, 2009
CUDA - calculation of a sum CUDA Programming and Performance	7	5484	April 30, 2010
Shared memory vs global memory CUDA Programming and Performance	6	3442	April 30, 2007
Shared memory and global memory containg different values CUDA Programming and Performance	0	510	February 22, 2011
shared memory CUDA Programming and Performance	2	2148	January 30, 2009
Help needed in generalising the kernel converting kernel from single block to multiple blocks CUDA Programming and Performance	5	1609	January 12, 2010
Simple Thread Problem CUDA Programming and Performance	1	4031	September 24, 2009
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1793	January 14, 2009

problem with shared mamery

Related topics