I believe atomicMax(float) & atomicMin(float) is a frequent operation for many application scenarios.
Although there are some type-casting implementations, I’m mainly concerned about the correctness & accuracy. Speed is my secondary concern.
Since these implementations have to rely on atomicCAS(), which only handles unsigned integers.
What’s the reason for the absence of atomicMax(float) in runtime library API?
int \ unsigned and float have an interesting property, that some ordering relationships are preserved when casting between. This may be of interest from a performance perspective. However (proper) implementations based on atomicCAS do not sacrifice accuracy/correctness.
atomic operations require specific hardware support. So the usual reason for these decisions is based on a cost/benefit analysis. Since it’s known that atomicMax / atomicMin can be performed at full accurary and approximately full speed using a method such as what I linked above, it would seem there would be less benefit associated with the cost of providing additional hardware support for it. That’s just my view/opinion of things.
It might be interesting to speculate why a suitable software implementation (“wrapper”) in one of the cuda header files is not provided for it. There may be some wrinkles using the method I linked that I’m not aware of.
On a second thought,
for float, we can use Xiaojin An’s implementation.
for double, we need to use jet47’s atomicCAS implementation.
Is that correct? Thanks!
If atomicMin and atomicMax on float are really that common (I don’t know one way or the other; I have never needed them), the best long-term approach would be to file an RFE (enhancement request) with NVIDIA via the bug reporting mechanism. The development of the GPU computing ecosystem is largely driven by customer requests, so the more people ask, the more likely it is that some feature will materialize. Realistically, one should anticipate that requests connected to revenue generation will weigh more heavily.
An RFE might prompt NVIDIA to add support via software emulation as a first step. At first glance, addressing all possible correctness issues regarding atomicity and proper ordering seems like a pain in the behind, in particular an implementation for double seems daunting. I would think that a hardware implementation for float wouldn’t be overly expensive. There is already a floating-point adder in place for atomicAdd. With an adder, one can also subtract, and when one can subtract, one can compare, and min/max is a comparison plus a MUX.
While RFEs don’t guarantee that desired features will materialize, it is pretty unlikely they will come into existence without an RFE.
The atomicCAS-in-a-loop method is mentioned in the programming guide as being useful for “any atomic operation” and in my experience that is a true statement for atomic operations on sizes up to 64 bit. The main drawback that I am aware of is related to performance, and this is especially noticeable when multiple threads are attempting to do atomics on the same location.
The float atomicMax operation based on ordering similarities to int \ unsigned is roughly captured in the answer by Xiaojin An, however as already indicated in that answer, making use of the ordering similarities is not possible in all cases by a simple cast from e.g. float to int. There are ranges that have to be mapped and handled. I think that answer is mostly correct, but according to my testing it does not give the proper result in all cases for atomicMax when the argument is float negative zero, and I have made a comment to that effect. Nevertheless a careful implementation like that, which handled all cases of interest correctly, should be preferred for performance reasons over the atomicCAS-in-a-loop method, I would expect, as it only needs to do one atomic op to complete the work.
I’m not aware that anyone has worked out a similar casting-and-ordering strategy for double and none of my comments in this thread were intended to communicate that. I don’t know if it is possible or not, I have not tried it, and I have not seen indications of anyone trying it.
With respect to accuracy, I would only attempt to discuss approaches that I would consider to have “proper accuracy”. My comments above assume one is interested only in approaches where the results of the proposed atomicMax operation are considered correct from a mathematical interpretation of maximum. Therefore I believe it is possible with sufficient attention to ranges and special cases to achieve an accurate float atomicMax using the previously discussed methods. The atomicCAS-in-a-loop method, even though it is typically casting to another data type, is not advanced in the programming guide as something that may be somehow “inaccurate”. It is intended to be accurate.
If you look at the definitions of fmin() and fmax() in C++, they state that if one operand is NaN, the other is returned. I am pretty sure that is what the GPU hardware instruction FMNMX does. A type-casting approach via an integer type will not naturally achieve this behavior.
Given such intricacies, the most user-friendly approach would be that NVIDIA engineering works out all the details and provides well-tested ready-to-use implementations of atominMin(float) and atomicMax(float) that work exactly analogous to existing fmin and fmax and are as fast as possible. Therefore my recommendation to file an RFE.
I spent some time fiddling with this and attempting to do some verification, so I’ll record some notes here. Nothing really earthshaking or new.
The logic from Xiaojing An’s answer seems to be approximately correct. There appears to be more than one way to skin the cat here, but I’ll start with that.
NaN are not handled correctly. Therefore I would suggest the following: 1. if a NaN value is provided for the atomic update, just return without doing the atomic. 2. If a NaN value is already present in the atomic location, this method may not give the proper result. In that case, if the proper handling of NaN is required, I would suggest using the atomicCAS-in-a-loop method.
negative zero handling has some ordering issues. The serious ones (e.g. negative zero ordered below negative one) can be rectified by adding float zero to the float value passed for atomic update. The remaining issue is that if negative zero is already present in the atomic location, then its ordering with respect to positive zero may not be correct. This doesn’t strike me as a serious issue, but if it is, again my suggestion would be to switch to the atomicCAS method.
To summarize, the method given by Xiaojing An, with the following precursors, seems like it may be useful:
If the value passed is NaN, simply return, taking no action.
Add float zero to the value passed.
Use the functions defined by Xiaojing An to complete the atomic op.
I’m using the term “ordering” a bit loosely here. What I mean specifically is that an inequality or equality test between the two numbers when considered as integers or unsigned integers is matching (or not) corresponding inequality or equality tests when considered as float values. The distinction here, AFAIK is for example evident in the “ordering” of positive and negative zero. Positive and negative zero are not “ordered” in the traditional sense for float values because by definition they are declared to be equal, but when considered as integers they are “ordered”.
One takeaway for me, is that a proper “wrapper” could not be created this way, and that is one possible explanation why CUDA does not provide one using this method. The only proper “wrapper” (that I know of) would be one that wrapped the atomicCAS-in-a-loop method. A template/example for that method is already given in the programming guide (and is also present in the previously linked post).