Larger than expected / documented error in erfcinvf()

For both CUDA 8.0 and CUDA 10.2, section E.1. of the CUDA Programing Guide states the maximum error of erfcinvf() (vs. the mathematical result rounded to single precision) as:

However, in following up on some observations, I found that the maximum error of this function in CUDA 8.0 seems to be 4 ulp, not 2 ulp. Below I am showing a minimal standalone reproducer that shows the worst case I found; this is not the only case of larger than expected error.

I don’t have CUDA 10.2 installed on my machine and would be much obliged if someone could run the reproducer with CUDA 10.2 to either confirm that the issue still exists currently, or that it has been fixed in the meantime.

#include <stdio.h>
#include <stdint.h>
#include <string.h>

__global__ void kernel (float a)
{
    float res = erfcinvf (a);
    double ref = erfcinv ((double)a);
    float reff = (float)ref;
    printf ("arg= %23.16e %15.6a (%08x) \n"
            "res= %23.16e %15.6a (%08x) <<<<\n"
            "ref= %23.16e %22.14a       \n"
            "reff=%23.16e %15.6a (%08x) <<<<\n",
            a, a, __float_as_int (a), 
            res, res, __float_as_int (res), 
            ref, ref, 
            reff, ref, __float_as_int (reff));
}

int main (void)
{
    int iarg = 0x3757c618;
    float arg;
    memcpy (&arg, &iarg, sizeof arg);
    kernel<<<1,1>>>(arg);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

I built the above code with

nvcc -o max_erfcinvf_error.exe -arch=sm_61 max_erfcinvf_error.cu

The output of the program on my machine (CUDA 8, Quadro P2000) looks as follows:

arg=  1.2861120922025293e-05  0x1.af8c30p-17 (3757c618)
res=  3.0847203731536865e+00   0x1.8ad81ep+1 (40456c0f) <<<<
ref=  3.0847213161387415e+00  0x1.8ad825e90b8430p+1
reff= 3.0847213268280029e+00   0x1.8ad826p+1 (40456c13) <<<<

Note the final number in each of the marked lines. These are the single-precision results of erfcinvf() and the reference result correctly rounded to single precision, respectively. The difference (0x40456c13 - 0x40456c0f) is 4 ulps (the actual difference of erfcinvf() vs. the mathematical result here is 3.95517 ulp).

With CUDA 10.2 Linux on a Titan Xp I get the same results.

nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

./erfcinvf
arg= 1.2861120922025293e-05 0x1.af8c30p-17 (3757c618)
res= 3.0847203731536865e+00 0x1.8ad81ep+1 (40456c0f) <<<<
ref= 3.0847213161387415e+00 0x1.8ad825e90b8430p+1
reff= 3.0847213268280029e+00 0x1.8ad826p+1 (40456c13) <<<<

Thank you for running the reproducer and confirming the issue (presumably a documentation issue) still exists. I guess that means I should file a bug now …

NVIDIA’s bug database keeps rejecting my bug report submission, even after I reduced it to a minimal stub: “An error occurred while processing your request.” Giving up.

There is maintenance going on for the bugs database right now. It was originally supposed to be done by now, and it was not supposed to create any outages. Unfortunately something has gone wrong and both of those statements are no longer correct.

At the moment the statement from IT is as follows:

“Extending the outage till 02 Feb, 4:00 AM PST due to NVBugs …”

Apologies for the hassle. I’m also unable to access nvbugs internally at the moment.

You may wish to try again tomorrow, or if I see no activity here I will file a bug to cover this issue as time permits after the system becomes available.

Thanks for reporting.

Thanks for the bug database status update. I will try again later.

In the meantime, I have looked more closely at the polynomial that performs the computation giving rise to the observed worst-case error, and I am quite sure that this is just a documentation issue, possibly a cut & paste error from the neighboring entry for erfinvf()).

The polynomial utilized by CUDA for erfinvf() uses the following coefficients (extracted from the binary):

coeff[0] = af8a6370 -2.51727084e-10 -0x1.14c6e0p-32
coeff[1] = 3221f645  9.42742862e-09  0x1.43ec8ap-27
coeff[2] = b4016fda -1.20547526e-07 -0x1.02dfb4p-23
coeff[3] = 3468f846  2.16970051e-07  0x1.d1f08cp-23
coeff[4] = 370742aa  8.06214848e-06  0x1.0e8554p-17
coeff[5] = b804db4d -3.16754922e-05 -0x1.09b69ap-15
coeff[6] = ba4afea1 -7.74363114e-04 -0x1.95fd42p-11
coeff[7] = 3bb5c027  5.54658799e-03  0x1.6b804ep-8
coeff[8] = 3e24ae0f  1.60820231e-01  0x1.495c1ep-3
coeff[9] = 3f62dfc4  8.86226892e-01  0x1.c5bf88p-1

If I replace those with coefficients computed by myself a while ago (https://devtalk.nvidia.com/default/topic/1036239/a-very-compact-implementation-of-the-inverse-error-function-erfinvf-/), the reproducer posted above still shows the same amount of error (likely triggered by the use of __log2f() in the argument transformation in a classical trade-off of accuracy vs performance). The alternative set of coefficients I tried:

-2.00998329e-10 -0x1.ba0000p-33
 7.63788321e-09  0x1.066f88p-27
-9.44120231e-08 -0x1.957f1ep-24
 1.27279494e-08  0x1.b5543ap-27
 8.98369126e-06  0x1.2d7152p-17
-3.40910883e-05 -0x1.1dfa0ep-15
-7.70850747e-04 -0x1.9425d6p-11
 5.54407062e-03  0x1.6b5612p-8
 1.60820886e-01  0x1.495c76p-3
 8.86226892e-01  0x1.c5bf88p-1

Bug 2836258 filed. In post #6 I was actually looking at the wrong one of two polynomials used in the computation of erfcinvf(). The correct set responsible is:

coeff2[0] = c27c73f1 -6.31132240e+01  -0x1.f8e7e2p+5
coeff2[1] = 42fef829  1.27484688e+02   0x1.fdf052p+6
coeff2[2] = c2e4361c -1.14105682e+02  -0x1.c86c38p+6
coeff2[3] = 42714d9b  6.03257866e+01   0x1.e29b36p+5
coeff2[4] = c1ae51b3 -2.17898922e+01  -0x1.5ca366p+4
coeff2[5] = 40cef504  6.46740913e+00   0x1.9dea08p+2
coeff2[6] = bfea9e05 -1.83294737e+00  -0x1.d53c0ap+0
coeff2[7] = bcf871f4 -3.03277746e-02  -0x1.f0e3e8p-6
coeff2[8] = 3f553775  8.32877457e-01   0x1.aa6eeap-1

I was able to tweak this for a maximum error of 3.809635 ulps, but that still means the documented bound would be stated as 4 ulps, i.e. yet more confirmation that this is a documentation bug.

-6.31093750e+1 -0x1.f8e000p+5
 1.27482948e+2  0x1.fdee8ap+6
-1.14106033e+2 -0x1.c86c94p+6
 6.03259277e+1  0x1.e29b80p+5
-2.17898617e+1 -0x1.5ca346p+4
 6.46740246e+0  0x1.9de9ecp+2
-1.83294737e+0 -0x1.d53c0ap+0
-3.03277746e-2 -0x1.f0e3e8p-6
 8.32877457e-1  0x1.aa6eeap-1

Thanks for reporting. The issue has been confirmed internally and should be fixed in a future documentation update.