Hi, I have quadro 2000 GPU, and here is a bit of code that I’m trying but I’m getting race condition if I exceed number of threads > 64
global void kernel(int hostArr, int tmpArr, int Levels,int index1, int index2)
int tid=threadIdx.x; deviceshared BufferArr;
for (int i=0;i<=Levels;i++)
I have been getting correct results with blockDim.x < 64, with greater than that race condition happens
So if anybody have solution, please do suggest something
No it shouldn’t work: just imagine the code for tid=16.
Then the line translates in BufferArr = BufferArr+BufferArr; right?
But what is happening for thread of tid 32 by the meantime? This one is not in the same warp, and thereafter doesn’t execute the code in lock-step with thread 16. So Thread 32 may or may not have already updated its value prior to 16’s reading, hence a race condition.
If you want to avoid it you have to split your line in a reading, then a writing of BufferArr, separated by a __syncthreads as showed by tera.