Nuances of Independence for bit operations

arcfide · March 25, 2016, 3:38am

I am trying to make decisions about algorithms for some parallel bit operations that I want to do. I am performing a set of bit operations on a bitvector, and I have the choice of iterating over the source or the target bit array. Iterating over the target array makes it easy to ensure that the loop is independent, since each iteration will write to a single element in the bit array without interference from the other loops. This could lead to unfavorable memory access patterns in the reads, however.

I was thinking about comparing the performance between the source and target iteration, except that it is not clear that source-centric iteration is necessarily correct in this case, depending on the semantics of OpenACC, which are not clear to me in this case.

Basically, for a given source element A, representing a chunk of a bit vector, we will store that array into potentially two target chunks, C and D. The next source element B will do the same, but may write to a different segment of C or D. Thus, there is a write interference at the byte level, but not at the bit level. Assuming that each assignment to C and D is atomic, then there is no interference, as the two operations performed on C/D will be independent of one another at the bit level, despite both writing to C/D. However, it is not clear to me that OpenACC guarantees these operations will be atomic, and thus, it seems possible to me that there could be a race condition on these operations.

So, if I have two different iterations that are writing to the same byte, but one only changes some of the bits, while the other changes the other bits, but neither interferes at the bit level, is this safe to do, or will I get a race condition, thus necessitating that I handle each target byte independently from the others (target-centric iterations instead of the source-centric iteration)?

MatColgrove · March 28, 2016, 4:55pm

Hi Aaron,

This really isn’t an OpenACC issue, rather it has to do with the smallest addressable memory unit of the architecture (CPU or GPU). So yes while each thread changes only a single bit, each must read/write at least a byte thus creating a dependency.

How big are the bitvectors? Could you use an array of “char” or “integer(1)” instead? You’d increase your memory usage but allow parallel access of the array.

Mat

arcfide · March 28, 2016, 6:59pm

Thanks for the information. I was hoping that somehow there would be a way to ensure that the operations could be made atomic or the like without overhead, but it appears that I have to organize my loop by writes rather than reads.

It’s not possible in this case for me to use just bytes instead of bits, as the compact representation is important.