For both CUDA 8.0 and CUDA 10.2, section E.1. of the CUDA Programing Guide states the maximum error of erfcinvf() (vs. the mathematical result rounded to single precision) as:

However, in following up on some observations, I found that the maximum error of this function in CUDA 8.0 seems to be 4 ulp, not 2 ulp. Below I am showing a minimal standalone reproducer that shows the *worst* case I found; this is not the only case of larger than expected error.

I don’t have CUDA 10.2 installed on my machine and would be much obliged if someone could run the reproducer with CUDA 10.2 to either confirm that the issue still exists currently, or that it has been fixed in the meantime.

```
#include <stdio.h>
#include <stdint.h>
#include <string.h>
__global__ void kernel (float a)
{
float res = erfcinvf (a);
double ref = erfcinv ((double)a);
float reff = (float)ref;
printf ("arg= %23.16e %15.6a (%08x) \n"
"res= %23.16e %15.6a (%08x) <<<<\n"
"ref= %23.16e %22.14a \n"
"reff=%23.16e %15.6a (%08x) <<<<\n",
a, a, __float_as_int (a),
res, res, __float_as_int (res),
ref, ref,
reff, ref, __float_as_int (reff));
}
int main (void)
{
int iarg = 0x3757c618;
float arg;
memcpy (&arg, &iarg, sizeof arg);
kernel<<<1,1>>>(arg);
cudaDeviceSynchronize();
return EXIT_SUCCESS;
}
```

I built the above code with

```
nvcc -o max_erfcinvf_error.exe -arch=sm_61 max_erfcinvf_error.cu
```

The output of the program on my machine (CUDA 8, Quadro P2000) looks as follows:

```
arg= 1.2861120922025293e-05 0x1.af8c30p-17 (3757c618)
res= 3.0847203731536865e+00 0x1.8ad81ep+1 (40456c0f) <<<<
ref= 3.0847213161387415e+00 0x1.8ad825e90b8430p+1
reff= 3.0847213268280029e+00 0x1.8ad826p+1 (40456c13) <<<<
```

Note the final number in each of the marked lines. These are the single-precision results of erfcinvf() and the reference result correctly rounded to single precision, respectively. The difference (0x40456c13 - 0x40456c0f) is 4 ulps (the actual difference of erfcinvf() vs. the mathematical result here is 3.95517 ulp).