This past week I learned about Borchardt’s algorithm (BA) [*]. Given positive m, n with m < n, the iteration
mi+1 = (mi + ni) / 2
ni+1 = √(mi+1 · ni)
converges to √(n2-m2) / arccos (m / n). With m = 1, n= √(x2+1), BA (1, √(x2+1)) converges to x / arctan(x). One can therefore use arctan(x) = x / BA (1, √(x2+1)).
The convergence of Borchardt’s algorithm is slow, so unless a method of convergence acceleration can be applied, it is of little practical utility. Generally speaking, convergence acceleration methods involve taking n consecutive values from a sequence of converging values and computing a significantly more accurate value from them.
In 1961, someone on the team responsible for the early Dutch computer Electrologica X1, possibly the British mathematician Peter Wynn, devised a simple acceleration scheme useful for arctangent computation via BA. Without prior familiarity with convergence acceleration, I have not been able to identify the method used for the X1 in the literature. Instead I simply performed a brute force search for the weights assigned to sequence values.
The exemplary single-precision implementation below uses just a single iteration and therefore combines just two sequence values to compute the final result. Relative error of this approximate arctangent computation is < 3.64e-5. Without the convergence acceleration, i.e. when simply using the last sequence value b1, the maximum relative error would be 9.97e-2. The code below is a full-range implementation that compiles to 14 instructions. For a “fast & furious” implementation that eliminates the handling of arguments with magnitude >= 264 one line of source code can be eliminated, resulting in a sequence of 11 instructions.
NOTE: This code relies on the availability of the instruction MUFU.SQRT
, which was added with the second-generation Maxwell architecture (sm_52
) in 2014. On earlier architectures this functionality would have to be emulated via MUFU.RCP(MUFU.RSQ(x))
, resulting in some reduction in performance and accuracy. As is pointed out in Forman S. Acton, Numerical methods that work, MAA 1990, p. 9, with simple scaling one can also compute atan(x) = x · rsqrt (x2+1) · 1/ BA (rsqrt (x2+1), 1)). I am currently working on re-arranging the arithmetic for the case of a single iteration of Borchardt’s algorithm with convergence acceleration, so that only reciprocal square root is used.
[code below updated 8/16/2024, 8/17/2024, 8/18/2024, 10/19/2024, 11/5/2024, 2/25/2025]
/*
Copyright (c) 2024, Norbert Juffa
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* Use MUFU.RCP directly */
__forceinline__ __device__ float gpu_rcp_approx (float divisor)
{
float r;
asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
return r;
}
/* Use MUFU.SQRT directly */
__forceinline__ __device__ float gpu_sqrt_approx (float radicand)
{
float r;
asm ("sqrt.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(radicand));
return r;
}
/* maximum error < 428.3 ulps, maximum relative error < 3.64e-5 */
__forceinline__ __device__ float fast_atanf (float x)
{
const float pio2 = 1.57079633f;
float b0, b1, r, t;
t = fmaf (x, x, 1.0f);
b0 = gpu_sqrt_approx (t);
t = 0.5f * t;
t = fmaf (0.5f, b0, t);
b1 = gpu_sqrt_approx (t);
r = fmaf (b0, 0.144315556f, 0.159451306f); // 0x1.278eeap-3, 0x1.468e68p-3
r = fmaf (b1, 0.696223438f, r); // 0x1.647766p-1
r = gpu_rcp_approx (r);
r = x * r;
if (fabsf (x) >= 0x1.0p64f) r = copysignf (pio2, x); // may be optional
return r;
}
[*] C.W. Borchardt, “Sur deux algorithmes analogues à celui de la moyenne arithmético-géométrique de deux éléments.” In L. Cremona and E. Beltrami (eds.), In Memoriam Dominici Chelini , Milan 1881, pp. 206-212.
Unbeknownst to Borchardt, the same algorithm had previously been discussed in correspondence between J.F. Pfaff and C.F. Gauss in1800, but this did not appear in print until several decades later when Gauss’s collected works were published: Gauss, Werke, Vol. 10, 1917, p. 234.
The iterative formula itself also appeared in part 1 of Jacques Schwab, Élémens de géométrie. Nancy: C.-J. Hissette 1813, p. 104. Schwab used it to compute the radii of the inscribed and inscribed circles of a polygon as the number of sides n doubled.
Let a and b be the radii of the inscribed and circumscribed circles of a regular n-gon with perimeter P. Then the corresponding quantities a’ and b’ of a regular 2-n-gon with the same perimeter P are:
a’ = (a + b) / 2
b’ = √ (a’ * b)
While Schwab provided a geometric proof, one can also demonstrate this based on the following. Given a regular n-gon with perimeter P, and each side S = P / n, the radius of the inscribed circle is
a = rinscribed = P / (2 n tan (π / n)) = S / (2 tan (π / n))
while the radius of the circumscribed circle is
b = rcircumscribed = P / (2 n sin (π / n)) = S / (2 sin (π / n))