CUDA Math API, difference between nearbyintf(), roundf() and rintf()?

CudaaduC · June 3, 2015, 5:18am

My understanding from the documentation is that for ties (0.5, -0.5,…) rintf() rounds “halfway cases rounded towards zero” while roundf() rounds “with halfway cases rounded away from zero”, how does nearbyintf() handle ties?

I need the device functionality of roundf() as efficiently as possible, yet the documentation mentions roundf() is the slow option.

little_jimmy · June 3, 2015, 9:58am

would the difference between rintf() and roundf() not always be 1?
towards zero and away from zero should always imply a difference of 1?

thus, the max cost penalty between the functions is:

if ((fabs(initial_value - round_result) == boundary)
{
round_result++; // or round_result–;
}

if this is correct, how expensive is that?

njuffa · June 3, 2015, 1:53pm

Both rintf() and nearbyintf() as defined in C/C++ apply one of the four IEEE-754 defined rounding modes (towards zero, towards positive infinity, towards negative infinity, to nearest or even) to its input, as specified by the current dynamic rounding mode. The two functions differ in their ability to raise exceptions.

The floating-point units in the GPU do not support a dynamic rounding mode, so the CUDA convention is that all C/C++ standard math functions operating with dynamic rounding mode use the C/C++ default setting of that rounding mode, which is round-to-nearest-or-even. Since the GPU also does not support floating-point exceptions it follows that rintf() and nearbyintf() are functionally identical.

The GPU has hardware support for the four IEEE-754 rounding modes. Therefore roundf() uses a short emulation sequence to achieve the desired fifth rounding mode, round-to-nearest-ties-away-from-zero. Emulation sequences are injected at the PTX level, the roundf() emulation appears to be translated into eight machine instructions by the CUDA 6.5 compiler, across multiple architectures.

If your data is range-restricted a more efficient emulation may be possible. Below is a sample implementation of roundf(). If the arguments to roundf() are known to be always positive, you can eliminate the calls to fabsf() and copysignf() in the code below. Likewise, you may be able to eliminate one or both sides of the range check if the inputs are naturally restricted to this or a narrower range.

__device__ float my_roundf (float a)
{
    float fa = fabsf (a);
    float t = ((fa >= 0.5f) && (fa <= 8388608.0f)) ? 0.5f : 0.0f;
    return copysignf (truncf (fa + t), a);
}

Note that the above implementation has one tiny functional difference to CUDA’s built-in roundf(). For single-precision arithmetic in CUDA, when a NaN result is returned, this is always the canonical NaN with bit pattern 0x7FFFFFFF. However, with the code above, if a NaN argument has bit 31 set, the function will return a pattern of 0xFFFFFFFF instead, due to the call to copysignf() at the end, which is a bit-level operation according to IEEE-754 (2008). Since all NaN patterns are treated the same, this makes no practical difference, but I thought I should point it out before someone performs an exhaustive test with bit-wise comparison of the result and finds this discrepancy.

Topic		Replies	Views
A faster implementation of roundf() CUDA Programming and Performance	0	489	September 28, 2020
rintf produces incorrect rounding result -1034.5 rounds to -1034.0 CUDA Programming and Performance	3	2200	September 15, 2008
Faster and more accurate implementation of logf() CUDA Programming and Performance	10	2127	August 7, 2024
When using the __hfma function, the Device and Host results differ by 1 ULP, which should be 0ulp in theory CUDA Programming and Performance cuda , kernel	8	103	December 12, 2024
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	3546	April 13, 2023
Possible Rounding/Precision Errors in CUDA Math APIs? GPU-Accelerated Libraries math-api	5	145	July 31, 2024
Examining the generated .ptx file CUDA Programming and Performance	13	2397	October 24, 2014
Accuracy-optimized implementation of tanf(), without performance impact CUDA Programming and Performance	1	479	July 5, 2022
Precision of floats does CUDA use half precision instead of single precision for floats? CUDA Programming and Performance	5	2285	March 15, 2010
More accurate and somewhat faster implementation of atanf() CUDA Programming and Performance	0	856	May 15, 2022

CUDA Math API, difference between nearbyintf(), roundf() and rintf()?

Related topics