Is there any alternative for __syncthreads() Your reply would help a lot

I have written the program for sum reduction. This computes the sum of array. The array size should be power of 2. The problem i have is it works for only 2 thread blocks. It wont work for multiple blocks because i am using __syncthreads.

Is there any alternative for __syncthreads(), syncthreads creates barrier for all the threads in a thread block. I am looking for alternative which creates barrier for all the thread blocks in a grid.

Using cudaThreadSynchronize() wont help here. I took sum reduction program as an example. I have same problem with other programs also.

this program works for maximum 32 number of inputs, which means 2 blocks,

This wont work on multiple blocks because using __syncthreads is essential and it

will restrict the execution of first block only. if we add 2 blocks computed sum

will be stored in first block so no issue. If we use 4 blocks than in first

iteration computed sum should be stored in first 2 blocks, this wont happen because

we are using __syncthreads.After executing first block it reinitializes threads to 0,1,2… So second block wont be executed in first iteration. My point is this prog works for 2 blocks only and i need alternative for __syncthreads.

heres the program u can test giving different inputs and see exact problem.

[codebox]#include <stdio.h>

#include <cuda.h>

global void redsum(int *a, int n)

{

for(int strid=n; strid>1 ; strid/=2)

{

int t = blockIdx.x * blockDim.x + threadIdx.x;

if(t<(strid/2))

  {

   a[t]+=a[t+(strid/2)];

/*The sum is calculated with first half array elements

    and second half array elements, for 16 array elements its a[0]+a[8]

    a[1]+a[9].....a[7]+a[15]*/

//printf(“partial sum=%d\n”,a[t]); //use emulation mode to see exact problem

}

   __syncthreads();

}

}

main()

{

int n,i,*a_h,*a_d, sum=0;

printf("Enter array size: ");

scanf("%d",&n);

size_t size = sizeof(int)*n;

a_h=(int *)malloc(size);

for(i=0;i<n;i++)

a_h[i]=i+1;

printf("\nThe array is: ");

for(i=0;i<n;i++)

printf("%d ",a_h[i]);

printf("\n");

for(i=0;i<n;i++)

sum+=a_h[i];

cudaMalloc((void**)&a_d,size);

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

int nb=n/16+((n%16==0)?0:1);

redsum<<<nb,16>>>(a_d,n);

/* this program works for maximum 32 number of inputs, which means 2 blocks,

This wont work on multiple blocks because using __syncthreads is essential and it

will restrict the execution of first block only. if we add 2 blocks computed sum

will be stored in first block so no issue. If we use 4 blocks than in first

iteration computed sum should be stored in first 2 blocks, this wont happen because

we are using __syncthreads. it reinitializes threads to 0,1,2… So second block wont be executed in first iteration

My point is this prog works for 2 blocks only*/

cudaMemcpy(a_h,a_d,size,cudaMemcpyDeviceToHost);

printf("\nthe sum computed by GPU is %d ", a_h[0]);

printf("\nthe sum computed by CPU is %d ", sum);

free(a_h);

cudaFree(a_d);

return 0;

}[/codebox]

Please help…

There is no global barrier function in the CUDA paradigm, and it isn’t necessary for the sort of reduction you are trying to implement.

There is a perfectly good implementation of a reduction in the SDK (and a nicely written paper discussing it) which you can look at for ideas. At worst you will require two passes to achieve the full reduction. The first reduces the input data set to an output data set with one entry per block. The second pass can either be performed by launching the reduction kernel again with a single block, or copy the data (or using zero copy memory if your hardware supports it) back to the host and performing the second pass on the host.

As avidday mentioned, there’s no ‘block-level’ barrier synch. However, for this particular problem you are trying to solve, cudaThreadSynchronize will perfectly fit in. This method is called parallel reduction and is one of the most famous methods used in GPU prog. paradigm. A nice presentation on this can be found at: http://developer.download.nvidia.com/compu…c/reduction.pdf

A simple figure (as given in the link above) which nicely summarizes the reduction process has been attached…
red.bmp (947 KB)