Hi, I have quadro 2000 GPU, and here is a bit of code that I’m trying but I’m getting race condition if I exceed number of threads > 64
global void kernel(int hostArr, int tmpArr, int Levels,int index1, int index2)
{
int tid=threadIdx.x; deviceshared BufferArr[1030];
BufferArr[tid]=1;
for (int i=0;i<=Levels;i++)
{
__syncthreads();
tmpArr[index1i+tid]=BufferArr[tid];
BufferArr[tid+blockDim.x]=hostArr[index2i+tid];
__syncthreads();
BufferArr[tid]=BufferArr[2tid]+BufferArr[2tid+1]
}
}
I have been getting correct results with blockDim.x < 64, with greater than that race condition happens
So if anybody have solution, please do suggest something
No it shouldn’t work: just imagine the code for tid=16.
Then the line translates in BufferArr[16] = BufferArr[32]+BufferArr[33]; right?
But what is happening for thread of tid 32 by the meantime? This one is not in the same warp, and thereafter doesn’t execute the code in lock-step with thread 16. So Thread 32 may or may not have already updated its value prior to 16’s reading, hence a race condition.
If you want to avoid it you have to split your line in a reading, then a writing of BufferArr, separated by a __syncthreads as showed by tera.