I am trying to make decisions about algorithms for some parallel bit operations that I want to do. I am performing a set of bit operations on a bitvector, and I have the choice of iterating over the source or the target bit array. Iterating over the target array makes it easy to ensure that the loop is independent, since each iteration will write to a single element in the bit array without interference from the other loops. This could lead to unfavorable memory access patterns in the reads, however.
I was thinking about comparing the performance between the source and target iteration, except that it is not clear that source-centric iteration is necessarily correct in this case, depending on the semantics of OpenACC, which are not clear to me in this case.
Basically, for a given source element A, representing a chunk of a bit vector, we will store that array into potentially two target chunks, C and D. The next source element B will do the same, but may write to a different segment of C or D. Thus, there is a write interference at the byte level, but not at the bit level. Assuming that each assignment to C and D is atomic, then there is no interference, as the two operations performed on C/D will be independent of one another at the bit level, despite both writing to C/D. However, it is not clear to me that OpenACC guarantees these operations will be atomic, and thus, it seems possible to me that there could be a race condition on these operations.
So, if I have two different iterations that are writing to the same byte, but one only changes some of the bits, while the other changes the other bits, but neither interferes at the bit level, is this safe to do, or will I get a race condition, thus necessitating that I handle each target byte independently from the others (target-centric iterations instead of the source-centric iteration)?