I would like to atomically add, but in such a way that the result is clamped.
I am adding values to addresses holding uint32_t
values.
But I want to do this in such a way that the end result never exceeds 0xffffffff
.
Theoretically, I could add to uint64_t
values instead, but by doing so, I would halve my bandwidth, so I would like to avoid this.
Are there any special instructions inside CUDA that will let me do this? Maybe control the overflow behaviour, somehow?