Hi, could you help me in debugging this program, this is about parallel sum reduction. if i take a array of 10 numbers, i will compute the sum of these numbers in parallel. heres the logic

1 2 3 4 5 6 7 8

(1+5) (2+6) (3+7) (4+8)

6 8 10 12

(6+10) (8+12)

16 20

(16+20)

36 — this is the answer

following is the code

#include <stdio.h>

#include <cuda.h>

#include <math.h>

**global** void redsum(int *a, int n)

{

for(int strid=n; strid>1 ; strid=strid/2) // used to divide size of array in each iteration

{

int t = blockIdx.x * blockDim.x + threadIdx.x;

if(t<strid/2) //makes sure to find sum upto middle of array in each iteration

{

a[t]+=a[t+(strid/2)]; //adding first half of array elements to second half of array elemets

```
// printf(" %d", a[t]); this line could be used in emulation mode
}
__syncthreads();
```

}

}

main()

{

int n,i,*a_h,*a_d;

printf(“Enter array size: “);

scanf(”%d”,&n);

size_t size = sizeof(int)*n;

a_h=(int *)malloc(size);

for(i=0;i<n;i++)

```
a_h[i]=i+1; // initializing array 1,2,3,4...
```

printf("\nThe array is: “);

for(i=0;i<n;i++)

printf(”%d “,a_h[i]);

printf(”\n");

cudaMalloc((void**)&a_d,size);

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

int nb=n/4 + ((n%4==0)?0:1); // i simply took 4 threads in a block, no reason

redsum<<<nb,4>>>(a_d,n);

cudaMemcpy(a_h,a_d,size,cudaMemcpyDeviceToHost);

printf("\n the sum is: %d", a_h[0]); //final sum will be in a[0]

return 0;

}

This program works correctly if u give input as 8. try giving odd numbers like 5 or 9, it doesnt give correct sum for odd numbers. Please help me in debugging this program.

Regards

Sharath