Hello,
I’m new to CUDA and I try to write a simple program with shared memory.
I tried to write a simple sum of array elements with CUDA. I used this algorithm to find the sum of elements:
for d := 0 to logn - 1 do
for k from 0 to n – 1 by 2^d + 1 in parallel do
x[k + 2^(d + 1) - 1] := x[k + 2^d - 1] + x [k + 2^(d + 1) - 1]
I want to have just a simple program , so I have just 512 elements, so I need 256 threads that fix in one block.
when I used global memory there is no problem test passed, but when I try to use shared memory the result is not the same as what in sequential algorithm on CPU, and test didn’t pass.
I try it with less elements , and i see that if i had less and equal to 8 elements , in case of using shard memory , the test passed otherwise isn’t.
notice: number of elements is in form of n=pow(2,i)
my code is:
// ===----------------- MP3 - Modify this function ---------------------===
//! @param g_idata input data in global memory
// result is expected in index 0 of g_idata
//! @param n input number of elements to scan from input data
// ===------------------------------------------------------------------===
global void reduction(float *g_data, int n)
{
int d,k;
extern device shared float temp;
int thid = threadIdx.x;
int ai ,bi;
//load aray to shared memory
temp[2*(thid+1)-2] = g_data[2*(thid+1)-2];
temp[2*(thid+1)-1] = g_data[2*(thid+1)-1];
__syncthreads();
//for(d =0 ; n >= (int)pow(2.0,(double)d+1); d++){
for(d =0 ; n >= (1<<(d+1)); d++){
k = thid*(1<<(d+1));
temp[k+(1<<(d+1))-1] += temp[k+(1<<d)-1]; //in case of using shared memory
//g_data[k+(1<<(d+1))-1] += g_data[k+(1<<d)-1];///in case of using global memory
__syncthreads();
}
if(thid == 0)
g_data[0]= temp[n-1]; //in case of using shared memory
//g_data[0] = g_data[n-1]; //in case of using global memory
}
and it is how i call this kernel:
num_elements=256;
...
dim3 dimGrid(1,1);
dim3 dimBlock(num_elements/2,1);
reduction<<<dimGrid,dimBlock>>>(d_data, num_elements);
...
Why when I use shared memory program doesn’t work for number of elements elements more than 8.
Would you please help me?