Problem with __syncthreads() It does not work for threads > 64

rixor786 · April 9, 2012, 4:57pm

Hi, I have quadro 2000 GPU, and here is a bit of code that I’m trying but I’m getting race condition if I exceed number of threads > 64

global void kernel(int hostArr, int tmpArr, int Levels,int index1, int index2)
{
int tid=threadIdx.x;
device shared BufferArr[1030];
BufferArr[tid]=1;
for (int i=0;i<=Levels;i++)
{
syncthreads();
tmpArr[index1i+tid]=BufferArr[tid];
BufferArr[tid+blockDim.x]=hostArr[index2i+tid];
syncthreads();
BufferArr[tid]=BufferArr[2tid]+BufferArr[2tid+1]
}
}

I have been getting correct results with blockDim.x < 64, with greater than that race condition happens
So if anybody have solution, please do suggest something

Thanks
Ronak

tera · April 10, 2012, 11:06am

__syncthreads() works as advertised, it is just missing in one place in your code:

__global__ void kernel(int *hostArr, int* tmpArr, int Levels,int index1, int index2)

{

   int tid=threadIdx.x;

   __device__ __shared__ BufferArr[1030];

   BufferArr[tid]=1;

   for (int i=0;i<=Levels;i++)

   {

    __syncthreads();

    tmpArr[index1*i+tid]=BufferArr[tid];

    BufferArr[tid+blockDim.x]=hostArr[index2*i+tid];

    __syncthreads();

    int temp = BufferArr[2*tid]+BufferArr[2*tid+1];

    __syncthreads();

    BufferArr[tid] = temp;

   }

}

rixor786 · April 10, 2012, 7:42pm

__syncthreads() works as advertised, it is just missing in one place in your code:

__global__ void kernel(int *hostArr, int* tmpArr, int Levels,int index1, int index2)

{

   int tid=threadIdx.x;

   __device__ __shared__ BufferArr[1030];

   BufferArr[tid]=1;

   for (int i=0;i<=Levels;i++)

   {

    __syncthreads();

    tmpArr[index1*i+tid]=BufferArr[tid];

    BufferArr[tid+blockDim.x]=hostArr[index2*i+tid];

    __syncthreads();

    int temp = BufferArr[2*tid]+BufferArr[2*tid+1];

    __syncthreads();

    BufferArr[tid] = temp;

   }

}

Thanks!! it sorted out, but still I don’t understand why the error was coming, I mean I had mapped it to distinct locations and

__syncthreads();

    BufferArr[tid] = BufferArr[2*tid]+BufferArr[2*tid+1];

 __syncthreads();

This should work, but it doesn’t.

Gilles_C · April 11, 2012, 5:24am

No it shouldn’t work: just imagine the code for tid=16.
Then the line translates in BufferArr[16] = BufferArr[32]+BufferArr[33]; right?
But what is happening for thread of tid 32 by the meantime? This one is not in the same warp, and thereafter doesn’t execute the code in lock-step with thread 16. So Thread 32 may or may not have already updated its value prior to 16’s reading, hence a race condition.
If you want to avoid it you have to split your line in a reading, then a writing of BufferArr, separated by a __syncthreads as showed by tera.

rixor786 · April 11, 2012, 9:43am

Thanks for clarification!!

Topic		Replies	Views
Problems with __syncthreads() CUDA Programming and Performance	2	958	May 4, 2013
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4180	August 26, 2010
Semantics of __syncthreads CUDA Programming and Performance	18	18394	January 2, 2008
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3496	April 30, 2014
problem with __syncthreads(); CUDA Programming and Performance	1	1702	December 15, 2011
shared memory and __syncthreads() one writer, n readers CUDA Programming and Performance	5	3051	August 25, 2008
why I do not have a problem with __syncthreads ? CUDA Programming and Performance	10	7290	May 26, 2010
A stupid question on __syncthread() function CUDA Programming and Performance	5	5880	May 17, 2022
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	5	2689	December 29, 2011
problem with '__syncthreads()' CUDA Programming and Performance	2	5204	August 23, 2011

Problem with __syncthreads() It does not work for threads > 64

Related topics