Weekend project: Fast arctangent computation with Borchardt's algorithm

This past week I learned about Borchardt’s algorithm (BA) [*]. Given positive m, n with m < n, the iteration

mi+1 = (mi + ni) / 2
ni+1 = √(mi+1 · ni)

converges to √(n2-m2) / arccos (m / n). With m = 1, n= √(x2+1), BA (1, √(x2+1)) converges to x / arctan(x). One can therefore use arctan(x) = x / BA (1, √(x2+1)).

The convergence of Borchardt’s algorithm is slow, so unless a method of convergence acceleration can be applied, it is of little practical utility. Generally speaking, convergence acceleration methods involve taking n consecutive values from a sequence of converging values and computing a significantly more accurate value from them.

In 1961, someone on the team responsible for the early Dutch computer Electrologica X1, possibly the British mathematician Peter Wynn, devised a simple acceleration scheme useful for arctangent computation via BA. Without prior familiarity with convergence acceleration, I have not been able to identify the method used for the X1 in the literature. Instead I simply performed a brute force search for the weights assigned to sequence values.

The exemplary single-precision implementation below uses just a single iteration and therefore combines just two sequence values to compute the final result. Relative error of this approximate arctangent computation is < 3.64e-5. Without the convergence acceleration, i.e. when simply using the last sequence value b1, the maximum relative error would be 9.97e-2. The code below is a full-range implementation that compiles to 14 instructions. For a “fast & furious” implementation that eliminates the handling of arguments with magnitude >= 264 one line of source code can be eliminated, resulting in a sequence of 11 instructions.

NOTE: This code relies on the availability of the instruction MUFU.SQRT, which was added with the second-generation Maxwell architecture (sm_52) in 2014. On earlier architectures this functionality would have to be emulated via MUFU.RCP(MUFU.RSQ(x)), resulting in some reduction in performance and accuracy. As is pointed out in Forman S. Acton, Numerical methods that work, MAA 1990, p. 9, with simple scaling one can also compute atan(x) = x · rsqrt (x2+1) · 1/ BA (rsqrt (x2+1), 1)). I am currently working on re-arranging the arithmetic for the case of a single iteration of Borchardt’s algorithm with convergence acceleration, so that only reciprocal square root is used.

[code below updated 8/16/2024, 8/17/2024, 8/18/2024, 10/19/2024, 11/5/2024, 2/25/2025]

/*
  Copyright (c) 2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Use MUFU.RCP directly */
__forceinline__ __device__ float gpu_rcp_approx (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* Use MUFU.SQRT directly */
__forceinline__ __device__ float gpu_sqrt_approx (float radicand)
{
    float r;
    asm ("sqrt.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(radicand));
    return r;
}

/* maximum error < 428.3 ulps, maximum relative error < 3.64e-5 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float b0, b1, r, t;

    t = fmaf (x, x, 1.0f);
    b0 = gpu_sqrt_approx (t);
    t = 0.5f * t;
    t = fmaf (0.5f, b0, t);
    b1 = gpu_sqrt_approx (t);
    r = fmaf (b0, 0.144315556f, 0.159451306f); // 0x1.278eeap-3, 0x1.468e68p-3
    r = fmaf (b1, 0.696223438f, r); // 0x1.647766p-1
    r = gpu_rcp_approx (r);
    r = x * r;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}

[*] C.W. Borchardt, “Sur deux algorithmes analogues à celui de la moyenne arithmético-géométrique de deux éléments.” In L. Cremona and E. Beltrami (eds.), In Memoriam Dominici Chelini , Milan 1881, pp. 206-212.

Unbeknownst to Borchardt, the same algorithm had previously been discussed in correspondence between J.F. Pfaff and C.F. Gauss in1800, but this did not appear in print until several decades later when Gauss’s collected works were published: Gauss, Werke, Vol. 10, 1917, p. 234.

The iterative formula itself also appeared in part 1 of Jacques Schwab, Élémens de géométrie. Nancy: C.-J. Hissette 1813, p. 104. Schwab used it to compute the radii of the inscribed and inscribed circles of a polygon as the number of sides n doubled.

Let a and b be the radii of the inscribed and circumscribed circles of a regular n-gon with perimeter P. Then the corresponding quantities a’ and b’ of a regular 2-n-gon with the same perimeter P are:

a’ = (a + b) / 2
b’ = √ (a’ * b)

While Schwab provided a geometric proof, one can also demonstrate this based on the following. Given a regular n-gon with perimeter P, and each side S = P / n, the radius of the inscribed circle is

a = rinscribed = P / (2 n tan (π / n)) = S / (2 tan (π / n))

while the radius of the circumscribed circle is

b = rcircumscribed = P / (2 n sin (π / n)) = S / (2 sin (π / n))

5 Likes

To satisfy my curiosity I tried adding one more iteration to the Borchardt algorithm and thus one more term to the weighted sum of the convergence acceleration scheme. While this improves accuracy almost to full single precision, it is not really attractive from a performance perspective, as CUDA’s built-in atanf() is quite fast.

[ code below updated 10/19/2024, 11/12/2024 ]

/*
  Copyright (c) 2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* maximum error < 5.99 ulps, maximum relative error < 5.12e-7 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float a1, b0, b1, b2, r, t;

    t = fmaf (x, x, 1.0f);
    b0 = gpu_sqrt_approx (t);
    t = 0.5f * t;
    t  = fmaf (0.5f, b0, t);
    a1 = fmaf (0.5f, b0, 0.5f);
    b1 = gpu_sqrt_approx (t);
    t = fmaf (a1, b1, t) * 0.5f;
    b2 = gpu_sqrt_approx (t);
    r = fmaf (b0, 0.076905564f, 0.076589905f); // 0x1.3b0154p-4, 0x1.39b656p-4
    r = fmaf (b1, 0.124636024f, r); // 0x1.fe8258p-4
    r = fmaf (b2, 0.721868575f, r); // 0x1.7198c2p-1
    r = gpu_rcp_approx (r);
    r = x * r;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}
1 Like

I figured another variant worth trying is to use just the starting approximation for Borchardt’s algorithm and see whether that could yield a useful approximation. It turns out that one can achieve just over seven correct bits in the result using this approach. I have no ideas where this low-accuracy approximation might come in handy, but maybe someone else knows of such an application.

/*
  Copyright (c) 2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Use MUFU.RCP directly */
__forceinline__ __device__ float gpu_rcp_approx (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* Use MUFU.SQRT directly */
__forceinline__ __device__ float gpu_sqrt_approx (float radicand)
{
    float r;
    asm ("sqrt.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(radicand));
    return r;
}

/* maximum error < 97748 ulps, maximum relative error < 6.33e-3 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float b0, r, t;

    t = fmaf (x, x, 1.0f);
    b0 = gpu_sqrt_approx (t);
    r = fmaf (b0, 0.640674412f, 0.362501860f); // 0x1.48067ap-1, 0x1.7333b0p-2
    r = gpu_rcp_approx (r);
    r = x * r;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}

I finally got around to rearranging the computation for the single-iteration Borchardt’s algorithm with convergence acceleration so that it relies on MUFU.RSQ instead of MUFU.SQRT. In addition to making this variant suitable for all GPUs, it is also very slightly more accurate than my previously posted implementation, and it may save an instruction depending on architecture (for example, on sm_89, without the clamping for arguments very large in magnitude, it compiles to just ten instructions).

/*
  Copyright (c) 2024-2025, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* Use MUFU.RCP directly */
__forceinline__ __device__ float gpu_rcp_approx (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* Use MUFU.RSQ directly */
__forceinline__ __device__ float gpu_rsqrt_approx (float divisor)
{
    float r;
    asm ("rsqrt.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* maximum error < 424.5 ulps, maximum relative error < 3.60e-5 */
__forceinline__ __device__ float fast_atanf (float x)
{
    const float pio2 = 1.57079633f;
    float a0, r, t;
    t = fma (x, x, 1.0f);
    a0 = gpu_rsqrt_approx (t);
    t = a0 + 1.0f;
    r = gpu_rsqrt_approx (t); 
    t = 0.492269814f * t;           // 0x1.f81594p-2
    t = fmaf (t, r, 0.144350037f);  // 0x1.27a0fep-3
    t = fmaf (a0, 0.159466118f, t); // 0x1.46962cp-3
    t = gpu_rcp_approx (t);
    r = a0 * x;
    r = r * t;
    if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
    return r;
}

Without the clamping for very large arguments, the above compiles to just ten instructions for an sm_89 target:

 FFMA R0, R4, R4, 1 
 MUFU.RSQ R3, R0 
 FADD R5, R3, 1 
 MUFU.RSQ R6, R5 
 FMUL R7, R5, 0.49226981401443481445 
 FFMA R6, R6, R7, 0.14435003697872161865 
 FFMA R6, R3.reuse, 0.15946611762046813965, R6 
 FMUL R3, R3, R4 
 MUFU.RCP R6, R6 
 FMUL R4, R6, R3 
1 Like