Reduction operation returns incorrect result

sreeram · November 18, 2018, 4:40pm

Hi,

The sum of an array (with 1681 elements) is computed using reduction approach. This approach works fine for arrays with number of elements less than 1024. But when, the number of elements is >1024 then it gives wrong results. The code snippet is provided below and the sample array is attached herewith. Can anyone suggest how to resolve this problem?

//const int N1 = 25;
const int N1 = 1681;
const int BLOCKS = 1;
const int halfblock = 512;

void global Compute_Sum(double * const out1,
double const * const in){
shared double sdata[N1];
sdata[threadIdx.x] = in[threadIdx.x + blockIdx.x*N1];
__syncthreads();
for (int i = halfblock; i>0; i>>=1){
if ((threadIdx.x < i) && (threadIdx.x+i < N1))
sdata[threadIdx.x] += sdata[threadIdx.x+i];
__syncthreads();}
if (!threadIdx.x){
out1[blockIdx.x] = sdata[0];
}
}

int main(int argc,char **argv) {

//FILE *f1=fopen("data_25.txt","r");
FILE *f1=fopen("data_1681.txt","r");
	double *h_data_in = (double*) malloc(N1*sizeof(double));
double *h_sum = ((double*) malloc(BLOCKS*sizeof(double)));

if(f1==NULL) return 1;
for(int i = 0; i < N1; ++i) {fscanf(f1, "%lf",&h_data_in[i]);}
    h_sum[0] = 0.0f;
for (int jj = 0; jj < N1; jj++){printf("\n h_data_in[%d] = %0.15f \n",jj,h_data_in[jj]);}

    double *d_data_in;
double *d_sum;

cudaMalloc((void **) &d_data_in,N1*sizeof(double));
cudaMalloc((void **) &d_sum,BLOCKS*sizeof(double));
cudaMemcpy(d_data_in,h_data_in,N1*sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy(d_sum,h_sum,BLOCKS*sizeof(double),cudaMemcpyHostToDevice);

  	Compute_Sum <<<BLOCKS,N1>>> (d_sum,d_data_in); 
  	cudaDeviceSynchronize();  
    
cudaMemcpy(h_sum, d_sum,BLOCKS*sizeof(double), cudaMemcpyDeviceToHost);
printf("\n Sum of %d elements = %0.15f \n",N1,h_sum[0]);

fclose(f1);
cudaDeviceReset();
return 0;

}

[/code]

data_1681.txt (17.4 KB)

Robert_Crovella · November 18, 2018, 8:53pm

I suggest using proper CUDA error checking (google that), and run your code with cuda-memcheck before asking for help here.

In CUDA it is illegal to have more than 1024 threads per block:

const int N1 = 1681;
...
Compute_Sum <<<BLOCKS,N1>>> (d_sum,d_data_in); 
                      ^^

Topic		Replies	Views
Calculation sum of array parts have large prime number elements CUDA Programming and Performance	5	1851	December 23, 2009
Basic reduction with CUDA CUDA Programming and Performance	1	499	March 22, 2018
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	864	July 16, 2010
CUDA - calculation of a sum CUDA Programming and Performance	7	5540	April 30, 2010
Problem with reduction CUDA Programming and Performance	4	2596	March 22, 2011
Reduction an array in to 10 elements by addition of elements based on remaining the indexes to 10 CUDA Setup and Installation	1	481	February 20, 2017
cublasDasum returns incorrect results GPU-Accelerated Libraries	3	991	May 3, 2017
sum columns of a 2 dimensional array with Reduce algorithm CUDA Programming and Performance	0	1158	December 6, 2018
Regarding Vector Reduction To find sum of all components of a vector CUDA Programming and Performance	1	3247	July 22, 2009
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2230	November 23, 2009

Reduction operation returns incorrect result

Related topics