Race condition at a simple global array access

So I’m a beginner in CUDA, I write my thesis in it.

I try to implement a recursive algorithm using a static task list.
Basically, there is an input array, which contains the data that the algorithm is working on. For each element of the input array, a thread starts. There is an output array, where the algorithm puts the results, and after each write in the output array, a global offset is incremented atomically. The algorithm can have 0-2 results to put in the output array.
After each iteration, the content of the output array is copied in the input array (this deletes the previous contents of the input array), and a new iteration starts.

I malloc both arrays in the host code for 10000 elements, and pass their pointers as parameters in the kernel.

I don’t know what’s the policy in this forum, but my code is pretty long, so I try to summerize the kernel here, but upload the major part of the code to pastebin.

My thesis is about Interval Newton method, but don’t be scared of the math, I try to avoid it while explaining the problem.
So the kernel:

/*
Xinf and Xsup are the input arrays
outInf, outSup - the result arrays
outSize - the offset of the output arrays
*/
__global__ void INewton(float* Xinf, float* Xsup, float* outInf, float* outSup, int* outSize)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  //calculations with Xinf[idx] and Xsup[idx]

  if(...){
    outInf[*outSize] = result1inf;
    outSup[*outSize] = result1sup;
    atomicAdd(outSize,1);
  }
  if(...){
    outInf[*outSize] = result2inf;
    outSup[*outSize] = result2sup;
    atomicAdd(outSize,1);
  }
}

And after the third iteration there are 2 threads, and at the end, outSize is 4 (as it should be), but the output arrays show only 3 elements each, of which 2 is correct, one is wrong. I debugged the code with Nsight, both if-s apply.

I think there is a race condition there, and both threads try to write in the same array location.

I search for a simple, beginner friendly solution to this.

Here is the full code in pastebin: http://pastebin.com/59iDhPV2

Thank you guys for your patience, I really can’t turn to anyone else with this problem.

From your short example above it looks like all of your threads would be writing to the same location outInf[*outSize] and outSup[*outSize]at once, since every thread would begin with the same value for outsize.

Maybe you could consider using atomics to update outInf and outSup? Although I’m sure this wouldn’t be very beneficial performance wise.