I have written the program for sum reduction. This computes the sum of array. The array size should be power of 2. The problem i have is it works for only 2 thread blocks. It wont work for multiple blocks because i am using __syncthreads.
Is there any alternative for __syncthreads(), syncthreads creates barrier for all the threads in a thread block. I am looking for alternative which creates barrier for all the thread blocks in a grid.
Using cudaThreadSynchronize() wont help here. I took sum reduction program as an example. I have same problem with other programs also.
this program works for maximum 32 number of inputs, which means 2 blocks,
This wont work on multiple blocks because using __syncthreads is essential and it
will restrict the execution of first block only. if we add 2 blocks computed sum
will be stored in first block so no issue. If we use 4 blocks than in first
iteration computed sum should be stored in first 2 blocks, this wont happen because
we are using __syncthreads.After executing first block it reinitializes threads to 0,1,2… So second block wont be executed in first iteration. My point is this prog works for 2 blocks only and i need alternative for __syncthreads.
heres the program u can test giving different inputs and see exact problem.
[codebox]#include <stdio.h>
#include <cuda.h>
global void redsum(int *a, int n)
{
for(int strid=n; strid>1 ; strid/=2)
{
int t = blockIdx.x * blockDim.x + threadIdx.x;
if(t<(strid/2))
{
a[t]+=a[t+(strid/2)];
/*The sum is calculated with first half array elements
and second half array elements, for 16 array elements its a[0]+a[8]
a[1]+a[9].....a[7]+a[15]*/
//printf(“partial sum=%d\n”,a[t]); //use emulation mode to see exact problem
}
__syncthreads();
}
}
main()
{
int n,i,*a_h,*a_d, sum=0;
printf("Enter array size: ");
scanf(“%d”,&n);
size_t size = sizeof(int)*n;
a_h=(int *)malloc(size);
for(i=0;i<n;i++)
a_h[i]=i+1;
printf("\nThe array is: ");
for(i=0;i<n;i++)
printf("%d ",a_h[i]);
printf(“\n”);
for(i=0;i<n;i++)
sum+=a_h[i];
cudaMalloc((void**)&a_d,size);
cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);
int nb=n/16+((n%16==0)?0:1);
redsum<<<nb,16>>>(a_d,n);
/* this program works for maximum 32 number of inputs, which means 2 blocks,
This wont work on multiple blocks because using __syncthreads is essential and it
will restrict the execution of first block only. if we add 2 blocks computed sum
will be stored in first block so no issue. If we use 4 blocks than in first
iteration computed sum should be stored in first 2 blocks, this wont happen because
we are using __syncthreads. it reinitializes threads to 0,1,2… So second block wont be executed in first iteration
My point is this prog works for 2 blocks only*/
cudaMemcpy(a_h,a_d,size,cudaMemcpyDeviceToHost);
printf("\nthe sum computed by GPU is %d ", a_h[0]);
printf("\nthe sum computed by CPU is %d ", sum);
free(a_h);
cudaFree(a_d);
return 0;
}[/codebox]
Please help…