I am pretty new to CUDA programming, and I have the following situation. I have a boolean array initialized to 0 in global memory that I want to write to, and I have multiple threads in parallel that either write a 1 to an element or do nothing. The problem is I will potentially have multiple threads writing to the same element of the array. I don’t know too much about how it works at the low-level, but I am assuming if I do multiple writes to the same memory location, then it will likely do the writes in serial and I will lose some parallelism.
I was wondering if the hardware is smart enough to take care of this, or if it will have an impact on efficiency. Is it possible to tell the hardware to only allow one write per array element and discard the rest? Because I don't need 5 threads to write a 1 to one element.