I would like to atomically add, but in such a way that the result is clamped.
I am adding values to addresses holding
But I want to do this in such a way that the end result never exceeds
Theoretically, I could add to
uint64_t values instead, but by doing so, I would halve my bandwidth, so I would like to avoid this.
Are there any special instructions inside CUDA that will let me do this? Maybe control the overflow behaviour, somehow?