Global Memory array versus very large bit array

I need to store a relatively large amount of information in global memory, basically a large array of boolean type. This would take up almost all of the global memory on the GPU. I would frequently need to check whether a specific bit is “set” but would not need to set bits very frequently. Often this would be a sequential access pattern, meaning I would often check if index 0 is set, then index 1…then 2 etc.

Would it be better to implement this as a (1) large bit array (presumably by writing one very large unsigned char) or an (2) array of a small primitive type (using only 1/0 values) or by defining my own (3) enum type (True/False).

Also if approach (1) is used, is there any difference between how a set bit check bit operation should be implemented on a GPU versus just plain C?

Unfortunately I only have compute capability 1.1, but suggestions for 2.0 specific code would still be very interesting.

Thanks

Is there a possibility of race conditions between different threads reading from and writing to the array? A disadvantage of a bitmask is the possibility of collisions between threads trying to update different bits in the same word. I think is possible to solve this with the use of a loop and atomicCAS to do the update, but could be tricky.

Unfortunately, to make use of the built-in atomic operations (first appearing in compute capability 1.1), you would need to use an unsigned integer array, wasting a very large amount of memory bandwidth.

If you go with the bit array approach, you should take a look at the __popc() function, which reports how many bits are set in a 32 bit word. On compute capability 2.0, it compiles to a single instruction, and could be a very useful shortcut if you expect your bit array to be sparsely populated.

Thank you very much for your help. I think I can rework some things so no two threads will cause a race condition, so I think I may go with this approach.

If it is not necessary for this array to be accessed outside of the Block (don’t need to transfer back to host at all either) would it be better to simply allocate in “local” memory rather than prior to Kernel execution in “global” memory? (It is too big for shared memory)

Also, to actually implement a bit array in CUDA would it be best practice to simply allocate a char* of the appropriate size and perform the standard bit set/get operations on this structure?

Thank you again so much!

So, just to be clear, each thread in a block gets its own private bit array? In that case, local memory would also be fine (it gets stored in global memory anyway), and might improve coalescing as well. I don’t think local memory can be dynamically allocated, so you would have to specify the size of these bit arrays at compile time.

Either unsigned char or unsigned int is probably best. (Unsigned, because the sign-extension behavior of right shifts is implementation-specific for signed integers in C. Better to avoid confusion later…) The only advantage to unsigned int would be the ability to load and manipulate more bits at a time, for example if __popc() were useful. The architecture, in general, is most efficient at working with 32-bit words.

However, if you need to random access bits in the array, then unsigned char might be more efficient, although that should be benchmarked. My hesitation is due to the fact that reads and writes will go through the L1 cache, which has a 128 byte line size, so reading a byte or a 32-bit word will cause the same I/O traffic if the memory location isn’t already in the cache.