Here is the prototype of an rsqrt() function implementing IEEE-754 round to nearest-or-even. While this code follows an approach well documented in published literature, and has run through a couple of billion test cases now (both random and pattern based), it has neither been mathematically proven to always deliver a correctly rounded result, nor has this been demonstrated by exhaustive testing. This prototype is also not optimized for performance, but rather constructed as a simple wrapper around CUDA’s existing rsqrt() implementation.
With those caveats in place, I think the code below is suitable to test your working hypothesis that CUDA’s current rsqrt() implementation is reponsible for the numerical discrepencancies you observed. If you find this hypothesis confirmed, I would suggest filing an enhancement request as I mentioned earlier. It would be useful to note in the RFE that the invsqrt() function delivers correctly rounded results on your platform, and how that fact is significant for your use case.
__device__ __forceinline__ double my_drsqrt_rn (double a)
{
double y, h, l, e;
unsigned int ilo, ihi, g, f;
int d;
ihi = __double2hiint(a);
ilo = __double2loint(a);
if (((unsigned int)ihi) - 0x00100000U < 0x7fe00000U){
f = ihi | 0x3fe00000;
g = f & 0x3fffffff;
d = g - ihi;
a = __hiloint2double(g, ilo);
y = rsqrt (a);
h = __dmul_rn (y, y);
l = __fma_rn (y, y, -h);
e = __fma_rn (l, -a, __fma_rn (h, -a, 1.0));
/* Round as shown in Peter Markstein, "IA-64 and Elementary Functions" */
y = __fma_rn (__fma_rn (0.375, e, 0.5), e * y, y);
d = d >> 1;
a = __hiloint2double(__double2hiint(y) + d, __double2loint(y));
} else if (a == 0.0) {
a = __hiloint2double ((ihi & 0x80000000) | 0x7ff00000, 0x00000000);
} else if (a < 0.0) {
a = __hiloint2double (0xfff80000, 0x00000000);
} else if (isinf (a)) {
a = __hiloint2double (ihi & 0x80000000, 0x00000000);
} else if (isnan (a)) {
a = a + a;
} else {
a = a * __hiloint2double (0x7fd00000, 0);
y = rsqrt (a);
h = __dmul_rn (y, y);
l = __fma_rn (y, y, -h);
e = __fma_rn (l, -a, __fma_rn (h, -a, 1.0));
/* Round as shown in Peter Markstein, "IA-64 and Elementary Functions" */
y = __fma_rn (__fma_rn (0.375, e, 0.5), e * y, y);
a = __hiloint2double(__double2hiint(y) + 0x1ff00000,__double2loint(y));
}
return a;
}