Weekend project: Fast arctangent computation with Borchardt's algorithm

njuffa · August 16, 2024, 2:28am

This past week I learned about Borchardt’s algorithm (BA) ^[*]. Given positive m, n with m < n, the iteration

m_i+1 = (m_i + n_i) / 2
n_i+1 = √(m_i+1 · n_i)

converges to √(n²-m²) / arccos (m / n). With m = 1, n= √(x²+1), BA (1, √(x²+1)) converges to x / arctan(x). One can therefore use arctan(x) = x / BA (1, √(x²+1)).

The convergence of Borchardt’s algorithm is slow, so unless a method of convergence acceleration can be applied, it is of little practical utility. Generally speaking, convergence acceleration methods involve taking n consecutive values from a sequence of converging values and computing a significantly more accurate value from them.

In 1961, someone on the team responsible for the early Dutch computer Electrologica X1, possibly the British mathematician Peter Wynn, devised a simple acceleration scheme useful for arctangent computation via BA. Without prior familiarity with convergence acceleration, I have not been able to identify the method used for the X1 in the literature. Instead I simply performed a brute force search for the weights assigned to sequence values.

The exemplary single-precision implementation below uses just a single iteration and therefore combines just two sequence values to compute the final result. Relative error of this approximate arctangent computation is < 3.64e-5. Without the convergence acceleration, i.e. when simply using the last sequence value b₁, the maximum relative error would be 9.97e-2. The code below is a full-range implementation that compiles to 14 instructions. For a “fast & furious” implementation that eliminates the handling of arguments with magnitude >= 2⁶⁴ one line of source code can be eliminated, resulting in a sequence of 11 instructions.

NOTE: This code relies on the availability of the instruction MUFU.SQRT, which was added with the second-generation Maxwell architecture (sm_52) in 2014. On earlier architectures this functionality would have to be emulated via MUFU.RCP(MUFU.RSQ(x)), resulting in some reduction in performance and accuracy. As is pointed out in Forman S. Acton, Numerical methods that work, MAA 1990, p. 9, with simple scaling one can also compute atan(x) = x · rsqrt (x²+1) · 1/ BA (rsqrt (x²+1), 1)). I am currently working on re-arranging the arithmetic for the case of a single iteration of Borchardt’s algorithm with convergence acceleration, so that only reciprocal square root is used.

[code below updated 8/16/2024, 8/17/2024, 8/18/2024, 10/19/2024, 11/5/2024, 2/25/2025]

/*
  Copyright (c) 2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Use MUFU.RCP directly */
__forceinline__ __device__ float gpu_rcp_approx (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* Use MUFU.SQRT directly */
__forceinline__ __device__ float gpu_sqrt_approx (float radicand)
{
    float r;
    asm ("sqrt.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(radicand));
    return r;
}

/* maximum error < 428.3 ulps, maximum relative error < 3.64e-5 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float b0, b1, r, t;

    t = fmaf (x, x, 1.0f);
    b0 = gpu_sqrt_approx (t);
    t = 0.5f * t;
    t = fmaf (0.5f, b0, t);
    b1 = gpu_sqrt_approx (t);
    r = fmaf (b0, 0.144315556f, 0.159451306f); // 0x1.278eeap-3, 0x1.468e68p-3
    r = fmaf (b1, 0.696223438f, r); // 0x1.647766p-1
    r = gpu_rcp_approx (r);
    r = x * r;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}

^[*] C.W. Borchardt, “Sur deux algorithmes analogues à celui de la moyenne arithmético-géométrique de deux éléments.” In L. Cremona and E. Beltrami (eds.), In Memoriam Dominici Chelini , Milan 1881, pp. 206-212.

Unbeknownst to Borchardt, the same algorithm had previously been discussed in correspondence between J.F. Pfaff and C.F. Gauss in1800, but this did not appear in print until several decades later when Gauss’s collected works were published: Gauss, Werke, Vol. 10, 1917, p. 234.

The iterative formula itself also appeared in part 1 of Jacques Schwab, Élémens de géométrie. Nancy: C.-J. Hissette 1813, p. 104. Schwab used it to compute the radii of the inscribed and inscribed circles of a polygon as the number of sides n doubled.

Let a and b be the radii of the inscribed and circumscribed circles of a regular n-gon with perimeter P. Then the corresponding quantities a’ and b’ of a regular 2-n-gon with the same perimeter P are:

a’ = (a + b) / 2
b’ = √ (a’ * b)

While Schwab provided a geometric proof, one can also demonstrate this based on the following. Given a regular n-gon with perimeter P, and each side S = P / n, the radius of the inscribed circle is

a = r_inscribed = P / (2 n tan (π / n)) = S / (2 tan (π / n))

while the radius of the circumscribed circle is

b = r_{circumscribed} = P / (2 n sin (π / n)) = S / (2 sin (π / n))

njuffa · September 2, 2024, 8:11am

To satisfy my curiosity I tried adding one more iteration to the Borchardt algorithm and thus one more term to the weighted sum of the convergence acceleration scheme. While this improves accuracy almost to full single precision, it is not really attractive from a performance perspective, as CUDA’s built-in atanf() is quite fast.

[ code below updated 10/19/2024, 11/12/2024 ]

/*
  Copyright (c) 2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* maximum error < 5.99 ulps, maximum relative error < 5.12e-7 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float a1, b0, b1, b2, r, t;

    t = fmaf (x, x, 1.0f);
    b0 = gpu_sqrt_approx (t);
    t = 0.5f * t;
    t  = fmaf (0.5f, b0, t);
    a1 = fmaf (0.5f, b0, 0.5f);
    b1 = gpu_sqrt_approx (t);
    t = fmaf (a1, b1, t) * 0.5f;
    b2 = gpu_sqrt_approx (t);
    r = fmaf (b0, 0.076905564f, 0.076589905f); // 0x1.3b0154p-4, 0x1.39b656p-4
    r = fmaf (b1, 0.124636024f, r); // 0x1.fe8258p-4
    r = fmaf (b2, 0.721868575f, r); // 0x1.7198c2p-1
    r = gpu_rcp_approx (r);
    r = x * r;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}

njuffa · October 19, 2024, 11:50pm

I figured another variant worth trying is to use just the starting approximation for Borchardt’s algorithm and see whether that could yield a useful approximation. It turns out that one can achieve just over seven correct bits in the result using this approach. I have no ideas where this low-accuracy approximation might come in handy, but maybe someone else knows of such an application.

/*
  Copyright (c) 2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Use MUFU.RCP directly */
__forceinline__ __device__ float gpu_rcp_approx (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* Use MUFU.SQRT directly */
__forceinline__ __device__ float gpu_sqrt_approx (float radicand)
{
    float r;
    asm ("sqrt.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(radicand));
    return r;
}

/* maximum error < 97748 ulps, maximum relative error < 6.33e-3 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float b0, r, t;

    t = fmaf (x, x, 1.0f);
    b0 = gpu_sqrt_approx (t);
    r = fmaf (b0, 0.640674412f, 0.362501860f); // 0x1.48067ap-1, 0x1.7333b0p-2
    r = gpu_rcp_approx (r);
    r = x * r;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}

njuffa · February 23, 2025, 12:31pm

I finally got around to rearranging the computation for the single-iteration Borchardt’s algorithm with convergence acceleration so that it relies on MUFU.RSQ instead of MUFU.SQRT. In addition to making this variant suitable for all GPUs, it is also very slightly more accurate than my previously posted implementation, and it may save an instruction depending on architecture (for example, on sm_89, without the clamping for arguments very large in magnitude, it compiles to just ten instructions).

/*
  Copyright (c) 2024-2025, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Use MUFU.RCP directly */
__forceinline__ __device__ float gpu_rcp_approx (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* Use MUFU.RSQ directly */
__forceinline__ __device__ float gpu_rsqrt_approx (float divisor)
{
    float r;
    asm ("rsqrt.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* maximum error < 424.5 ulps, maximum relative error < 3.60e-5 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float a0, r, t;
    t = fma (x, x, 1.0f);
    a0 = gpu_rsqrt_approx (t);
    t = a0 + 1.0f;
    r = gpu_rsqrt_approx (t); 
    t = 0.492269814f * t;           // 0x1.f81594p-2
    t = fmaf (t, r, 0.144350037f);  // 0x1.27a0fep-3
    t = fmaf (a0, 0.159466118f, t); // 0x1.46962cp-3
    t = gpu_rcp_approx (t);
    r = a0 * x;
    r = r * t;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}

Without the clamping for very large arguments, the above compiles to just ten instructions for an sm_89 target:

 FFMA R0, R4, R4, 1 
 MUFU.RSQ R3, R0 
 FADD R5, R3, 1 
 MUFU.RSQ R6, R5 
 FMUL R7, R5, 0.49226981401443481445 
 FFMA R6, R6, R7, 0.14435003697872161865 
 FFMA R6, R3.reuse, 0.15946611762046813965, R6 
 FMUL R3, R3, R4 
 MUFU.RCP R6, R6 
 FMUL R4, R6, R3

Topic		Replies	Views
More accurate and somewhat faster implementation of atanf() CUDA Programming and Performance	0	915	May 15, 2022
Accuracy-optimized implementation of tanf(), without performance impact CUDA Programming and Performance	1	547	July 5, 2022
A faster and more accurate implementation of sincosf() CUDA Programming and Performance	25	9722	August 6, 2017
Unacceptably low accuracy of calculations on the GPU CUDA Programming and Performance cuda	12	2308	May 15, 2024
A more accurate and faster implementation of powf() CUDA Programming and Performance	6	4518	February 13, 2025
Performance tweak for single-precision square root CUDA Programming and Performance	0	1014	March 25, 2021
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8493	September 9, 2011
Optimized version of single-precision error function, erff() CUDA Programming and Performance	21	4726	December 25, 2017
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	41	9276	July 24, 2025
Calling all Juffas! What's up with erfcf() nowadays? CUDA Programming and Performance	5	945	August 28, 2023

Weekend project: Fast arctangent computation with Borchardt's algorithm

Related topics