Need help in debugging parallel sum reduction program

:wave: Hi, could you help me in debugging this program, this is about parallel sum reduction. if i take a array of 10 numbers, i will compute the sum of these numbers in parallel. heres the logic

1 2 3 4 5 6 7 8

(1+5) (2+6) (3+7) (4+8)

6 8 10 12

(6+10) (8+12)

16 20

(16+20)

36 — this is the answer

following is the code

#include <stdio.h>
#include <cuda.h>
#include <math.h>

global void redsum(int *a, int n)
{
for(int strid=n; strid>1 ; strid=strid/2) // used to divide size of array in each iteration
{
int t = blockIdx.x * blockDim.x + threadIdx.x;
if(t<strid/2) //makes sure to find sum upto middle of array in each iteration
{
a[t]+=a[t+(strid/2)]; //adding first half of array elements to second half of array elemets

   // printf(" %d", a[t]);    this line could be used in emulation mode
  
  }

  __syncthreads();

}

}

main()
{
int n,i,*a_h,*a_d;
printf(“Enter array size: “);
scanf(”%d”,&n);

size_t size = sizeof(int)*n;

a_h=(int *)malloc(size);

for(i=0;i<n;i++)

a_h[i]=i+1;               // initializing array  1,2,3,4...

printf("\nThe array is: “);
for(i=0;i<n;i++)
printf(”%d “,a_h[i]);
printf(”\n");

cudaMalloc((void**)&a_d,size);

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

int nb=n/4 + ((n%4==0)?0:1); // i simply took 4 threads in a block, no reason

redsum<<<nb,4>>>(a_d,n);

cudaMemcpy(a_h,a_d,size,cudaMemcpyDeviceToHost);

printf("\n the sum is: %d", a_h[0]); //final sum will be in a[0]

return 0;

}

This program works correctly if u give input as 8. try giving odd numbers like 5 or 9, it doesnt give correct sum for odd numbers. Please help me in debugging this program.

Regards
Sharath

__global__ void redsum(int *a, int n)

{

  for(int strid=n; strid>1; strid=strid/2) // used to divide size of array in each iteration

  {

	int t = blockIdx.x * blockDim.x + threadIdx.x;

	if(t<strid/2) //makes sure to find sum upto middle of array in each iteration

	{

	  a[t]+=a[t+(strid/2)]; //adding first half of array elements to second half of array elemets

	}

  __syncthreads();

  }

}

Have you actually tried execuing this code for small n (say: 3) using paper and pencil?

For n=3, strid=3 and strid/2=1 (integer rounding down). So your for loop will iterate only once and your if condition will be taken by 1 thread only. All in all you will sum up only two numbers.

In current state I believe your code would work only for n being a power of 2.

Exactly, thats the problem i have, how can i make this program work for all numbers, should i forget this kernel and write a newer one?

You are right, it works only 2 power number of inputs, i will write program with different method, thanks for the help

I think it would just suffice to replace all occurences of “strid/2” with “(strid+1)/2” with an exception of “if” statement which should remain as it is.