CUDA Math API, difference between nearbyintf(), roundf() and rintf()?

My understanding from the documentation is that for ties (0.5, -0.5,…) rintf() rounds “halfway cases rounded towards zero” while roundf() rounds “with halfway cases rounded away from zero”, how does nearbyintf() handle ties?

I need the device functionality of roundf() as efficiently as possible, yet the documentation mentions roundf() is the slow option.

would the difference between rintf() and roundf() not always be 1?
towards zero and away from zero should always imply a difference of 1?

thus, the max cost penalty between the functions is:

if ((fabs(initial_value - round_result) == boundary)
round_result++; // or round_result–;

if this is correct, how expensive is that?

Both rintf() and nearbyintf() as defined in C/C++ apply one of the four IEEE-754 defined rounding modes (towards zero, towards positive infinity, towards negative infinity, to nearest or even) to its input, as specified by the current dynamic rounding mode. The two functions differ in their ability to raise exceptions.

The floating-point units in the GPU do not support a dynamic rounding mode, so the CUDA convention is that all C/C++ standard math functions operating with dynamic rounding mode use the C/C++ default setting of that rounding mode, which is round-to-nearest-or-even. Since the GPU also does not support floating-point exceptions it follows that rintf() and nearbyintf() are functionally identical.

The GPU has hardware support for the four IEEE-754 rounding modes. Therefore roundf() uses a short emulation sequence to achieve the desired fifth rounding mode, round-to-nearest-ties-away-from-zero. Emulation sequences are injected at the PTX level, the roundf() emulation appears to be translated into eight machine instructions by the CUDA 6.5 compiler, across multiple architectures.

If your data is range-restricted a more efficient emulation may be possible. Below is a sample implementation of roundf(). If the arguments to roundf() are known to be always positive, you can eliminate the calls to fabsf() and copysignf() in the code below. Likewise, you may be able to eliminate one or both sides of the range check if the inputs are naturally restricted to this or a narrower range.

__device__ float my_roundf (float a)
    float fa = fabsf (a);
    float t = ((fa >= 0.5f) && (fa <= 8388608.0f)) ? 0.5f : 0.0f;
    return copysignf (truncf (fa + t), a);

Note that the above implementation has one tiny functional difference to CUDA’s built-in roundf(). For single-precision arithmetic in CUDA, when a NaN result is returned, this is always the canonical NaN with bit pattern 0x7FFFFFFF. However, with the code above, if a NaN argument has bit 31 set, the function will return a pattern of 0xFFFFFFFF instead, due to the call to copysignf() at the end, which is a bit-level operation according to IEEE-754 (2008). Since all NaN patterns are treated the same, this makes no practical difference, but I thought I should point it out before someone performs an exhaustive test with bit-wise comparison of the result and finds this discrepancy.