Calling all Juffas! What's up with erfcf() nowadays?

dscerutti · August 12, 2023, 9:24pm

Hello NJuffa, this message really is for you if you still frequent this forum! Otherwise, I think it touches on enough topics of interest that other guests can benefit.

I’m revisiting some of the work I did back in 2017, making that spline table to estimate d/dr [ erfc(alpha * r) / r ], now with a first class C++ object, better unit testing, and all sorts of options. In just a few days, I’ve re-engineered the approach with a strategy that is more general, more accurate, and probably even publishable.

For comparison, I need to see what computing erfcf(alpha * r) / r and its derivatives would look like analytically, how it would stack up to this method. In my original spline approach, I was also exceeding the accuracy of the analytic computation, even after applying your fasterfc(float x) function which I think was designed to overcome numerical instabilities in erfcf() over the range of interest, based of course on your observations that log(erfc()) is very tractable to a low-order polynomial in the region where we want to work for molecular dynamics. (That fact is, to a degree, also the basis of a new idea I had a week or two ago, which led me back to this erfc() business in the first place.)

So, for the purposes of comparing 32-bit floating point calculations, I think I should start with your fasterfc() polynomial function. But, I also recall you saying more recently, “the CUDA compiler of 2023 is not the CUDA compiler of 2012.” Indeed. In 2014 or so they rolled a form of Kahan summation into the float => llint conversion. which obviated the need for a special function to accelerate that in pmemd.cuda, which itself had been written by another NVIDIA researcher. Have the math functions changed to any degree?

Also, do I have your permission to implement your fasterfc() in my new STORMM code base? I will attribute, and the code base is planned for open-source release. I would also like to have the code to fasterfc() in a supplement to a paper I’m writing up, which will likely present this new approach to splining functions using a log-indexed table among other material.

Cheers!

njuffa · August 13, 2023, 7:14am

The latest version of an erfcf() implementation for CUDA that I posted in these forums about a year ago is here:

I have only the vaguest recollections of “fastererfc”. In terms of incorporating it into other code, the most straightforward approach would be for me to re-post it with a 2-clause BSD-license attached. That way you can cut & paste it into any code base without issues. The practical issue that prevents me from doing that immediately is that I have literally thousands of files sitting around on my PC without version control.

If you could briefly remind me of the specifications of “fastererfc” I should be able to find and review the relevant piece of code or worst case reconstruct it in short order:

(1) What was the input domain (I seem to recall [0,4] was common in your field)?
(2) What kind of error bound was specified (absolute or relative)?
(3) What was the numeric value of the error bound?

I retired from NVIDIA in 2014 and cannot give you an overview of changes in the CUDA math library since then. I am aware that for some of the math functions NVIDIA has taken advantage of code I posted in these forums (whether verbatim or in modified form I do not know) because the copyright notice from the accompanying 2-clause BSD license appears in NVIDIA’s documentation. I am also aware of improvements to single-precision division and expf() as part of ongoing maintenance. I discover such differences coincidentally when looking at SASS for my own work, but I am not systematically hunting for such changes or tracking them.

CUDA’s single precision math functions benefit in minor fashion from changes to the GPU architectures over the years. For example, recent GPUs allow FP32 operations to incorporate a full FP32 literal constant, whereas older GPUs supported only truncated FP32 constants. Also, the introduction of instructions like LOP3 and IADD3 provide minor performance improvements.

njuffa · August 13, 2023, 8:17am

Below is a version of a fast erfc() computation that I found among my files, applied some minor cleanup to, and established error bounds for. Unfortunately the file contained just the code, with no specifications, annotation, or rationale. The large relative error and ulp errors occur near 4.0, where function values are small, e.g. erfc(4) = 1.541725790028×10^-8.

Note that some of the most recent GPU architectures have reduced MUFU throughput to 1/8 of FP32 throughput which can create new bottlenecks in code that heavily relies on MUFU operations. The code below uses 10 FP32 operations plus 1 MUFU operation, so still presents a balanced load.

Let me know if the code below is sufficient for your purposes or whether you need something different.

/*
  Copyright (c) 2023, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY ERFCRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

__forceinline__ __device__ float raw_ex2 (float a)
{
    float r;
    asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

/* Approximate erfc(x) on [0, 4] with maximum absolute error of 1.15291e-7,
   maximum relative error of 3.36619e-4, and maximum ulp error of 5621.588.
*/
__forceinline__ __device__ float fasterfc (float a) 
{
    float t;
    t =             -1.64611265e-6f;   // -0x1.b9e000p-20
    t = fmaf (t, a,  2.95254722e-5f);  //  0x1.ef5af0p-16
    t = fmaf (t, a, -2.33422339e-4f);  // -0x1.e985aap-13
    t = fmaf (t, a,  1.04246172e-3f);  //  0x1.11466cp-10
    t = fmaf (t, a, -2.55015842e-3f);  // -0x1.4e411ep-9
    t = fmaf (t, a,  3.19798535e-4f);  //  0x1.4f5544p-12
    t = fmaf (t, a,  2.76054665e-2f);  //  0x1.c449b8p-6
    t = fmaf (t, a, -1.48274124e-1f);  // -0x1.2faa58p-3
    t = fmaf (t, a, -9.18447673e-1f);  // -0x1.d63ec6p-1
    t = fmaf (t, a, -1.62790680e+0f);  // -0x1.a0be80p+0
    t = t * a;
    return raw_ex2 (t);
}

dscerutti · August 13, 2023, 4:14pm

Thanks for getting back to me. The fasterfc() I have from your contributions many years ago folds in an extra factor alpha, so it is solving erfc(alpha * x), but even still the series of ten constants used looks remarkably similar (they are pre-multiplied by the corresponding powers of alpha, computed in double precision and then converted to float). I bet I could optimize the ULPs a bit if I have time. The critical range in x is on the order of [0, 12) and the useful values of alpha are about [0.3, 0.5].

I will check up on the IP issues, but we use the MIT license in our software, quite compatible with your two-clause BSD license.

njuffa · August 14, 2023, 4:46pm

It seems likely that the old fast_erfc() code I found was tuned to minimize absolute error. Here are additional implementations tuned to minimize relative error.

function     input       polyn.  max. absol.  max. relat.  max. ulp
name         domain      degree  error        error        error 
------------+-----------+-------+------------+------------+---------
fasterfc_2   [0, 4]         8    2.32194e-7   9.85422e-6    160.3582
fasterfc_3   [0, 4]         9    1.26963e-7   3.56880e-6     54.6531
fasterfc_4   [0, 8]         9    1.44147e-6   2.67342e-4   4395.3957 
fasterfc_5   [0, 8]        10    5.90040e-7   1.61373e-4   2606.9118
fasterfc_6   [0, 8]       3,3    1.03834e-6   1.27386e-4   2093.4753
fasterfc_7   [0, 8]       4,3    2.00354e-7   2.24757e-5    360.4341
fasterfc_8   [0, 9.19455] 4,4    1.25624e-7   2.54512e-5    376.0001
fasterfc_9   [0, 9.19455] 3,3    9.49246e-6   1.32030e-5    216.1361

Since the approximations use the raw MUFU.EX2 instruction which has flush-to-zero behavior, [0, 9.19455] is the widest interval that makes sense for float computation, because erfc (9.19455) < 2^-126, so underflows to zero. Within the constraints of the chosen algorithm, the most accurate approximation for each interval is near optimal. The rational core approximations in particular have an error of around 3.5 ulp, and this error is magnified through exponentiation by about 95x for the interval [0, 8] and about 125x for the interval [0, 9.19455].

[Code below updated 8/17/2023, 8/21/2023, 8/24/2023, 8/30/2023, 10/20/2023, 12/19/2023, 1/23/2024]

/*
  Copyright (c) 2023-2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY ERFCRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

__forceinline__ __device__ float raw_ex2 (float a)
{
    float r;
    asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

__forceinline__ __device__ float raw_rcp (float a)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

/* Approximate erfc(x) on [0, 4] with maximum absolute error of 2.32194e-7,
   maximum relative error of 9.85422e-6, and maximum ulp error of 160.3582.
*/
__forceinline__ __device__ float fasterfc_2 (float a) 
{
    float t;
    t =              3.53250653e-6f;  //  0x1.da2000p-19
    t = fmaf (t, a, -6.21843719e-5f); // -0x1.04d1f6p-14
    t = fmaf (t, a,  4.33897338e-4f); //  0x1.c6f96ep-12
    t = fmaf (t, a, -1.29289564e-3f); // -0x1.52ecc2p-10
    t = fmaf (t, a, -1.19574741e-3f); // -0x1.397540p-10
    t = fmaf (t, a,  2.86198277e-2f); //  0x1.d4e842p-6
    t = fmaf (t, a, -1.48608550e-1f); // -0x1.3059aep-3
    t = fmaf (t, a, -9.18406487e-1f); // -0x1.d63960p-1
    t = fmaf (t, a, -6.27907574e-1f); // -0x1.417d1ap-1
    t = fmaf (t, a, -a);
    return raw_ex2 (t);
}

/* Approximate erfc(x) on [0, 4] with maximum absolute error of 1.26963e-7,
   maximum relative error of 3.56880e-6, and maximum ulp error of 54.65311.
*/
__forceinline__ __device__ float fasterfc_3 (float a) 
{
    float t;
    t =             -8.75908881e-7f;  // -0x1.d64000p-21
    t = fmaf (t, a,  1.85479294e-5f); //  0x1.372ec0p-16
    t = fmaf (t, a, -1.68654078e-4f); // -0x1.61b178p-13
    t = fmaf (t, a,  8.37384723e-4f); //  0x1.b707e0p-11
    t = fmaf (t, a, -2.17419676e-3f); // -0x1.1cf9f0p-9
    t = fmaf (t, a, -8.09702251e-5f); // -0x1.539d1ep-14
    t = fmaf (t, a,  2.78421026e-2f); //  0x1.c82a3ep-6
    t = fmaf (t, a, -1.48342669e-1f); // -0x1.2fce48p-3
    t = fmaf (t, a, -9.18440282e-1f); // -0x1.d63dcep-1 
    t = fmaf (t, a, -0.62790716e+0f); //  0x1.417d0cp-1
    t = fmaf (t, a, -a);
    return raw_ex2 (t);
}

/* Approximate erfc(x) on [0, 8] with maximum absolute error of 1.44147e-6,
   maximum relative error of 2.67342e-4, and maximum ulp error of 4395.396.
*/
__forceinline__ __device__ float fasterfc_4 (float a) 
{
    float t;
    t =             -4.10509529e-8f;  // -0x1.60a000p-25
    t = fmaf (t, a,  1.43723912e-6f); //  0x1.81ce52p-20
    t = fmaf (t, a, -2.00503782e-5f); // -0x1.5063b8p-16
    t = fmaf (t, a,  1.30545755e-4f); //  0x1.11c638p-13
    t = fmaf (t, a, -1.91030107e-4f); // -0x1.909e82p-13
    t = fmaf (t, a, -3.36878025e-3f); // -0x1.b98d82p-9
    t = fmaf (t, a,  3.08938008e-2f); //  0x1.fa29fep-6
    t = fmaf (t, a, -1.49739161e-1f); // -0x1.32aa72p-3
    t = fmaf (t, a, -9.18201983e-1f); // -0x1.d61e92p-1
    t = fmaf (t, a, -6.27913177e-1f); // -0x1.417dd6p-1
    t = fmaf (t, a, -a);
    return raw_ex2 (t);
}

/* Approximate erfc(x) on [0, 8] with maximum absolute error of 5.90040e-7,
   maximum relative error of 1.61373e-4, and maximum ulp error of 2606.912.
*/
__forceinline__ __device__ float fasterfc_5 (float a) 
{
    float t;
    t =              1.01827027e-8f;  //  0x1.5de000p-27
    t = fmaf (t, a, -4.20768572e-7f); // -0x1.c3cbfcp-22
    t = fmaf (t, a,  7.39652114e-6f); //  0x1.f05f44p-18
    t = fmaf (t, a, -7.12081019e-5f); // -0x1.2aab1ep-14
    t = fmaf (t, a,  3.91975424e-4f); //  0x1.9b041ap-12
    t = fmaf (t, a, -1.00240123e-3f); // -0x1.06c602p-10
    t = fmaf (t, a, -1.87209458e-3f); // -0x1.eac224p-10
    t = fmaf (t, a,  2.93564573e-2f); //  0x1.e0f9e8p-6
    t = fmaf (t, a, -1.48969084e-1f); // -0x1.3116b4p-3
    t = fmaf (t, a, -9.18343842e-1f); // -0x1.d6312ap-1
    t = fmaf (t, a, -6.27909422e-1f); // -0x1.417d58p-1
    t = fmaf (t, a, -a);
    return raw_ex2 (t);
}

/* Approximate erfc(x) on [0, 8] with maximum absolute error of 1.03834e-6,
   maximum relative error of 1.27386e-4, and maximum ulp error of 2093.475.
*/
__forceinline__ __device__ float fasterfc_6 (float a)
{
    float p, q;
    q =             -4.31835651e-5f;  // -0x1.6a4000p-15
    q = fmaf (q, a,  1.43829882e-1f); //  0x1.269048p-3
    q = fmaf (q, a,  6.66979790e-1f); //  0x1.557e60p-1
    q = fmaf (q, a,  1.00000000e+0f); //  0x1.000000p+0
    p =             -2.05620244e-1f;  // -0x1.a51c3ap-3 
    p = fmaf (p, a, -9.93970633e-1f); // -0x1.fce9b8p-1
    p = fmaf (p, a, -2.00440097e+0f); // -0x1.009036p+1
    p = fmaf (p, a, -6.27902687e-1f); // -0x1.417c76p-1
    p = fmaf (p, a, -a);
    q = raw_rcp (q);
    return raw_ex2 (p * q);
}

/* Approximate erfc(x) on [0, 8] with maximum absolute error of 2.00354e-7,
   maximum relative error of 2.24757e-5, and maximum ulp error of 360.4340.
*/
__forceinline__ __device__ float fasterfc_7 (float a)
{
    float p, q;
    q =              3.59107554e-2f;  //  0x1.262e50p-5
    q = fmaf (q, a,  2.75154710e-1f); //  0x1.19c228p-2
    q = fmaf (q, a,  7.82261491e-1f); //  0x1.908494p-1
    q = fmaf (q, a,  1.00000000e+0f); //  0x1.000000p+0
    p =             -5.17663509e-2f;  // -0x1.a811e8p-5
    p = fmaf (p, a, -3.99674177e-1f); // -0x1.994430p-2
    p = fmaf (p, a, -1.31461883e+0f); // -0x1.508adcp+0
    p = fmaf (p, a, -2.19190407e+0f); // -0x1.189050p+1
    p = fmaf (p, a, -6.27906919e-1f); // -0x1.417d04p-1
    p = fmaf (p, a, -a);
    q = raw_rcp (q);
    return raw_ex2 (p * q);
}

/* Approximate erfc(x) on [0, 9.194558] with maximum abs. error of 1.25624e-7,
   maximum relative error of 2.54512-5, and maximum ulp error of 376.0001.
*/
__forceinline__ __device__ float fasterfc_8 (float a)
{
    float p, q;
    q =             -1.14660520e-6f;  // -0x1.33ca1cp-20
    q = fmaf (q, a,  4.20600772e-2f); //  0x1.588e60p-5
    q = fmaf (q, a,  3.01798075e-1f); //  0x1.350a8ep-2
    q = fmaf (q, a,  8.17630708e-1f); //  0x1.a2a07ep-1
    q = fmaf (q, a,  1.00000000e+0f); //  0x1.000000p+0
    p =             -6.05805367e-2f;  // -0x1.f04698p-5
    p = fmaf (p, a, -4.39263731e-1f); // -0x1.c1ce5ap-2
    p = fmaf (p, a, -1.39052355e+0f); // -0x1.63f95ap+0
    p = fmaf (p, a, -2.24947500e+0f); // -0x1.1feeccp+1
    p = fmaf (p, a, -6.27906859e-1f); // -0x1.417d02p-1
    p = fmaf (p, a, -a);
    q = raw_rcp (q);
    return raw_ex2 (p * q);
}

/* Approximate erfc(x) on [0, 9.194558] with maximum abs. error of 9.49246e-6,
   maximum relative error of 1.32030e-5, and maximum ulp error of 216.1361.
*/
__forceinline__ __device__ float fasterfc_9 (float a)
{
    float p, q, s;
    p =             -4.98443842e-5f;  // -0x1.a22000p-15
    p = fmaf (p, a,  2.12357581e-1f); //  0x1.b2e888p-3
    p = fmaf (p, a,  7.15674877e-1f); //  0x1.6e6cf0p-1
    p = fmaf (p, a,  1.00000000e+0f); //  0x1.000000p+0
    q =              3.74178112e-1f;  //  0x1.7f288cp-2
    q = fmaf (q, a,  1.29051912e+0f); //  0x1.4a5f76p+0
    q = fmaf (q, a,  1.84437013e+0f); //  0x1.d828a4p+0
    q = fmaf (q, a,  1.00000000e+0f); //  0x1.000000p+0
    q = raw_rcp (q);
    p = p * q;
    s = -a * a;
    return p * raw_ex2 (fmaf (s, 4.42695041e-1f, s));
}

system · August 28, 2023, 4:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimized version of single-precision error function, erff() CUDA Programming and Performance	21	4656	December 25, 2017
An accuracy-optimized performance-competitive implementation of erfcf() CUDA Programming and Performance	0	749	August 2, 2022
Passing textures to a __global__ function CUDA Programming and Performance	24	3447	June 2, 2017
An accuracy-optimized performance-competitive implementation of `erfcf` GPU-Accelerated Libraries	1	539	August 2, 2022
Accuracy-optimized performance-neutral implementation of erfcxf() CUDA Programming and Performance	2	471	September 4, 2022
Larger than expected / documented error in erfcinvf() CUDA Programming and Performance	7	606	February 13, 2020
Fastmath functions Speed or accuracy CUDA Programming and Performance	8	21637	April 16, 2009
Strange behavior of cosf function (possible bug ?) CUDA Programming and Performance	13	2230	March 6, 2013
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8436	September 9, 2011
Accuracy-optimized implementation of erfcinvf(), without performance impact CUDA Programming and Performance	0	270	December 2, 2023

Calling all Juffas! What's up with erfcf() nowadays?

Related topics