[GPU algorithms] writing in parallel to a bitset

Here’s a GPU algorithms question relevant to a problem I’m trying to speed up:

Let’s suppose I conceptually have a data field like the following, where 512 is the number of threads in the warp:

bool is_a_foo[131072][512]

The bools in this structure represent whether data elsewhere (which happens to have similar dimensions… but that’s irrelevant) is a foo. For simplicity, let’s assume I’m just running on one GPU block, with each thread ripping through (in lock step via a __syncwarp()… but please don’t let that be too distracting, as in practice I’m doing something more sensical) locations 0->131071. In other words, each thread’s code looks something like the following:

// assume is_a_foo is initialized earlier to 0's by some sort of memset call
for (int i = 0; i < 131072; ++i) {
    if (something_kind_of_expensive_but_not_the_bottleneck()) {
        is_a_foo[ i ][thread] = true;
    }
}

With each bool represented as 8 bits, no data is lost. However, let’s suppose that I’d like to tighten up the memory/cache footprint and bandwidth consumption. We could instead represent the above data structure as:

unsigned int is_a_foo[131072][512 / (sizeof(unsigned int) * 8)];

And we can perform bit arithmetic to set the particular bit of interest to 1.

The problem is that without any special handling, the writes to is_a_foo will smash each other, and not every field that should be set to a 1 will necessarily be set to a 1.

In the case that we’re willing to do something special, we can use atomicCAS to ensure that no writes are lost. Unfortunately, this seems kind of expensive. Indeed, in my application, where a kernel launch takes about 30 milliseconds, the kernel execution time increases by approximately 50%. (It’s currently unclear whether the additional time is due to the atomic op or the extra instructions, but I suspect it’s the atomic op.)

One thing that would mitigate the damage is if I were able to operate on unsigned chars instead of unsigned ints. Unfortunately, CUDA provides no such interface. And, when I operate on unsigned shorts, I get a compiler error about the function not being available for unsigned shorts (details available upon request).

All this is to ask, are there any algorithms/data structures that are a good fit for this type of operation on a GPU?

One idea is to use atomicAdd instead of atomicCAS, which could at least theoretically, be cheaper to implement in the hardware.

This would work in this situation, because only one thread will ever be writing to one bit in the array. Therefore, before adding 1 in the appropriate slot, I can test to make sure the bit of interest is 0 before performing the addition. If the bit is already 1, then I’ve simply set that bit earlier, and there’s no need to try to set the bit to 1 yet again (which would result in an overflow in this case).

I should have stated that in my application, bits will only ever go from 0->1 but never from 1->0, and that bits can be set more than once (it’s as if there’s an outer for loop that’s not shown for the sake of presenting a “simple” question).

The cross posting has a sensible answer, IMO:

https://stackoverflow.com/questions/55192870/writing-in-parallel-to-a-bitset-on-a-gpu

You may wish to study the various warp collective instrinsics. They are documented in the programming guide.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions

Thanks @Robert_Crovella. The __ballot functions may do the trick. Will report back with how it goes.