Sure. I’ve also improved the code a little bit after looking at it again.
I started from the example code from appendix B.11 of the Programming Guide for double precision atomic add, which loops until atomicCAS() is able to set the desired value. Then it just needs a bit of byte juggling, for which I use the [font=“Courier New”]__byte_perm()[/font] intrinsic that is described in appendix C.2.3 of the Programming guide.
[font=“Courier New”]address & ~3[/font] rounds the address downwards to the previous multiple of four to find the word where the char value is located. The cast to [font=“Courier New”]size_t[/font] enables us to use bit operations on the pointer.
[font=“Courier New”]0x3210[/font] is the selector that makes [font=“Courier New”]__byte_perm()[/font] just return its first argument. We substitute a 4 at the place where the new byte value is to be inserted. The place where the new byte should end up is determined by the lowest to bits of the original byte address.
For calculating [font=“Courier New”]min_[/font] (I’ve added an underscore because [font=“Courier New”]min[/font] is already taken by the intrinsic function of the same name) we use another [font=“Courier New”]__byte_perm()[/font] to extract the previous value of the char. When I edited the code, I set the higher nibbles of the selector to 4 so that the higher bytes of the resulting value will explicitly get set to the second argument, i.e. zero, although in Nvidia’s current implementation the cast to [font=“Courier New”]char[/font] is already sufficient to ensure that.
To compute the desired word [font=“Courier New”]new_[/font] (again with an underscore as [font=“Courier New”]new[/font] already is a C++ keyword) we juggle the [font=“Courier New”]min_[/font] value into place using the selector already obtained.
Unlike the [font=“Courier New”]atomicAdd()[/font] example, it is quite likely that the desired new value is the same value already in place, so in the edited code I added an optimization to skip the atomic operation in that case. (One could have also just branched out of the loop if the previous char value already was larger or equal than the provided one, implementing a maximum function by hand. But as the intrinsic [font=“Courier New”]max()[/font] function compiles to a single instruction, I left the code as it is as I find it more descriptive).
[font=“Courier New”]atomicCAS()[/font] will then write the new word to memory, provided no external change to the memory location has occurred in the meantime that might render our calculation invalid. If such an external change is detected, we start all over.