Atomic Adding to a Clamped Value

I would like to atomically add, but in such a way that the result is clamped.

I am adding values to addresses holding uint32_t values.

But I want to do this in such a way that the end result never exceeds 0xffffffff.

Theoretically, I could add to uint64_t values instead, but by doing so, I would halve my bandwidth, so I would like to avoid this.

Are there any special instructions inside CUDA that will let me do this? Maybe control the overflow behaviour, somehow?

There aren’t any native clamped atomic add operations.

You can implement your own “custom” atomic given the template in the programming guide, but that is going to have a performance impact.

Perhaps you could use ordinary atomic adds and use an auxiliary bit to indicate overflow. If an add causes the result to overflow, i.e., sum is less than original value when doing an unsigned comparison, the thread can set the overflow bit in another memory location. The downside is that when you want to use the value, you need to check both the overflow bit and the main variable. The upside is that the atomic addition runs at full speed except for the (presumably rare) case where an overflow occurs and a second memory access is needed to set the overflow bit.