Problem with __syncthreads() It does not work for threads > 64

Hi, I have quadro 2000 GPU, and here is a bit of code that I’m trying but I’m getting race condition if I exceed number of threads > 64

global void kernel(int hostArr, int tmpArr, int Levels,int index1, int index2)
{
int tid=threadIdx.x;
device shared BufferArr[1030];
BufferArr[tid]=1;
for (int i=0;i<=Levels;i++)
{
__syncthreads();
tmpArr[index1i+tid]=BufferArr[tid];
BufferArr[tid+blockDim.x]=hostArr[index2
i+tid];
__syncthreads();
BufferArr[tid]=BufferArr[2tid]+BufferArr[2tid+1]
}
}

I have been getting correct results with blockDim.x < 64, with greater than that race condition happens
So if anybody have solution, please do suggest something

Thanks
Ronak

__syncthreads() works as advertised, it is just missing in one place in your code:

__global__ void kernel(int *hostArr, int* tmpArr, int Levels,int index1, int index2)

{

   int tid=threadIdx.x;

   __device__ __shared__ BufferArr[1030];

   BufferArr[tid]=1;

   for (int i=0;i<=Levels;i++)

   {

    __syncthreads();

    tmpArr[index1*i+tid]=BufferArr[tid];

    BufferArr[tid+blockDim.x]=hostArr[index2*i+tid];

    __syncthreads();

    int temp = BufferArr[2*tid]+BufferArr[2*tid+1];

    __syncthreads();

    BufferArr[tid] = temp;

   }

}

Thanks!! it sorted out, but still I don’t understand why the error was coming, I mean I had mapped it to distinct locations and

__syncthreads();

    BufferArr[tid] = BufferArr[2*tid]+BufferArr[2*tid+1];

 __syncthreads();

This should work, but it doesn’t.

No it shouldn’t work: just imagine the code for tid=16.
Then the line translates in BufferArr[16] = BufferArr[32]+BufferArr[33]; right?
But what is happening for thread of tid 32 by the meantime? This one is not in the same warp, and thereafter doesn’t execute the code in lock-step with thread 16. So Thread 32 may or may not have already updated its value prior to 16’s reading, hence a race condition.
If you want to avoid it you have to split your line in a reading, then a writing of BufferArr, separated by a __syncthreads as showed by tera.

Thanks for clarification!!