Using __device__ variable by multiple threads


I am playing around with cuda.

At the moment I have a problem. I am testing a large array for particular responses, and when I get the response, I have to copy the data onto another array.

For example, my test array of 5 elements looks like this:

Result must look like this:

The problem is how do I calculate the address of the second array to store the result? All elements of the first array are checked in parallel.

I am thinking to declare a device variable int addr = 0. Every time I find a response, I will increment the addr. But I am not sure about that because it means that addr may be accessed by multiple threads at the same time. Will that cause problems? Or will the thread wait until another thread finishes using that variable?