Correctly rounded rsqrt in double precision?

Are there any plans to provide a correctly rounded (up to 1/2 ulp in round-to-nearest mode for all the valid arguments) rsqrt(double) function in CUDA math library anytime soon?

Alternatively, does anyone know of a third-party implementation?

Thanks, Vedran

The existing double-precision rsqrt(), while not always delivering a correctly rounded result, has a maximum ulp error of < 0.52 ulps compared to the mathematical reference, and (from memory) about 98% of results are correctly rounded. Could you give some details about the use case that requires a correctly rounded rsqrt() implementation?

You may want to consider filing an enhancement request using the bug reporting form linked from the registered developer website. If you do, please prefix the synopsis with “RFE:” to make it readily recognizable as a feature request rather than a bug report.

The use case:

When computing (a very special kind of) a plane rotation, there is a handful of mathematically equivalent, but numerically different sets of formulas to choose from.

A formula set which, after extensive testing on a CPU, showed the best numerical orthogonality of the computed rotations is the one that employs invsqrt (Intel math library’s correctly rounded rsqrt – at least that’s what Intel claims). The tests were carefully designed, using strict IEEE double arithmetic + FMA.

In my algorithm there are, at max, a few billion rotations computed and applied on the input data.
However, if the same formulas are implemented on a (Fermi) GPU, using double intrinsics to exactly match the CPU code, with only invsqrt changed to rsqrt, relative accuracy of the output drops by about half the order of magnitude (e.g., from 1e-14 to 5e-13), or even more. In other words, behaves similarly to the less numerically orthogonal cases.

So, even beyond the exact reproducibility of the results, the last bit missed here and there really does make a difference.

Excellent, that’s exactly what I’ll do!

Many thanks for a fast and helpful answer,

Vedran

I am curious, are these rotations Givens rotations by any chance?

I take it you have carefully excluded all other possible reasons for the numerical discrepancies leaving only the rsqrt()? For example if you were to use FMA on the GPU and plain FMUL/FADD on the CPU, that could lead to rather large differences. For a quick try of a belt-and-suspenders approach you could use compiler flags in addition to inrinsics: If you compile the CPU code with -fp-model precise, and the GPU code with -fmad=false, does that affect the numerical differences?

I am wondering: Are you concerned mostly with getting the most accurate results possible, or primarily that the computations match between CPU and GPU? If it is the former, you may want to check which of the computations (GPU or CPU) delivers the more accurate answer by comparing with an extended-precision computation, e.g. by using double-double arithmetic. I use that approach extensively in my own work.

Could you point me to relevant information about the error bounds of invsqrt()? I am looking at my Intel Parallel Studio XE 2013 SP1 installation but seem to be unable to find that information. Am I correct to assume that you are using VML’s vdInvSqrt() in VML_HA mode?

No (but a close guess!), these are Jacobi rotations.

I did everything in my power, except writing assembly code by hand.

I use Intel provided C99 fma(double) function for FMA (called from Fortran via ISO_C_BINDING interface) since my CPU is not Haswell and has no native FMA support. Provided the emulatated fma works as expected by IEEE 754-2008, I suppose it should exactly match __fma_rn intrinsic.
Also, the code was forbidden to flush denormals to zero, change division to multiplication with a reciprocal, etc.

I want to get the most accurate result possible, but without resorting to extended precision (I’m quite concerned with speed, as well). Exact match between CPU and GPU is an added bonus that makes a comparison of the CPU and GPU implementations possible, in terms of accuracy.

(However, just a side note: the numerical orthogonality of the formula sets is carefully checked on CPU in quadruple (128-bit) precision. A computed rotation (sines and cosines) in double are casted to REAL(16) and then the orthogonality test is performed. I really want to trust the results before jumping to a conclusion.)

This is the best I can find:
http://software.intel.com/sites/products/documentation/doclib/mkl_sa/111/vml/functions/_accuracyall.html
HA means High Accuracy, and if I recall correctly, these are the functions called when fp model is conservative enough (as in my case).
The document above speaks about Vector Math Library, but if Intel knows how to compute invsqrt for each element of a vector accurately, presumably a scalar call does the same.

P.S. Seems we fell out of sync somewhere, so just to answer:

No, just invsqrt, declared in mathimf.h header, and a piece of Fortran that calls it without any extra work:

interface c_invsqrt
  function dinvsqrt(x) bind(c,name='invsqrt')
    use, intrinsic :: iso_c_binding
    real(c_double) :: dinvsqrt
    real(c_double), value :: x
  end function dinvsqrt
end interface c_invsqrt

According to the referenced page, the accuracy of double-precision invsqrt is platform dependent. If I read the data correctly it states that in HA mode, the accuracy is 0.50 ulps on a Westmere CPU [middle set of results], and 0.52 ulps on a SandyBridge CPU . So on the latter invsqrt would not be correctly rounded.

While it is entirely possible that the small differences between rsqrt / invsqrt could be the root cause of the numerical discrepancies you observe, I think it is worth double-checking alternative root causes. I suggested a comparison with a double-double computation purely as a numerical check, not something you would want to deploy in production. Another quick experiment would be to replace the use of FMA with plain adds and multiplies (i.e. __dadd_rn()), __dmul_rn(() intrinsics in CUDA). This would eliminate FMA as a possible source of discrepancies.

I have some prototype code for a correctly rounded double-precision rsqrt() somewhere. Let me check whether that is in a suitable state to be posted here.

Well, situation is far worse than that.
I do have a Westmere (Xeon E5620), with the latest Intel toolset, so I believed everything they said.
Also, the only logical thing, when there is a vector (vdInvSqrt) and a scalar (invsqrt) processing function, would be that these two produce the same outputs.

To verify that, I generated about 33 million double numbers, uniformly distributed in (0,1), and:
a) Computed 1/sqrt(x) in REAL(16).
b) Just to be sure that Intel’s quadruple implementation isn’t flawed, ran MPFR’s rec_sqrt with absurdly high precision (4096 bits).
c) Compared the results from a) and b), and they are bit-identical.
d) Tested:

  1. rsqrt on a GPU (S2050),
  2. invsqrt, and
  3. vdInvSqrt on a CPU,
    and compared that with a/b.
    All three results have a couple of thousand examples that are not correctly rounded.
    Furthermore, the results are pairwise different, so even the invsqrt == vdInvSqrt assumption fails miserably.
    Seems the numerical conclusions have to be re-checked with the “problematic” operations done in quadruple.

You might be right, Intel’s FMA emulation should also be scrutinized.
Seems that taking anything for granted is dangerous.

That would be fantastic!

It seems you are very diligent in your investigation. It will be a couple of days before I get to look into the rsqrt_rn() prototype code. If it turns out that the code is in a “postable” state, it will be neither exhaustively tested nor mathematically proven. It should be sufficient to check the hypothesis that the accuracy of the GPU results is negatively affected by the fact that the rsqrt() function is not correctly rounded.

Here is the prototype of an rsqrt() function implementing IEEE-754 round to nearest-or-even. While this code follows an approach well documented in published literature, and has run through a couple of billion test cases now (both random and pattern based), it has neither been mathematically proven to always deliver a correctly rounded result, nor has this been demonstrated by exhaustive testing. This prototype is also not optimized for performance, but rather constructed as a simple wrapper around CUDA’s existing rsqrt() implementation.

With those caveats in place, I think the code below is suitable to test your working hypothesis that CUDA’s current rsqrt() implementation is reponsible for the numerical discrepencancies you observed. If you find this hypothesis confirmed, I would suggest filing an enhancement request as I mentioned earlier. It would be useful to note in the RFE that the invsqrt() function delivers correctly rounded results on your platform, and how that fact is significant for your use case.

__device__ __forceinline__ double my_drsqrt_rn (double a)
{
  double y, h, l, e;
  unsigned int ilo, ihi, g, f;
  int d;

  ihi = __double2hiint(a);
  ilo = __double2loint(a);
  if (((unsigned int)ihi) - 0x00100000U < 0x7fe00000U){
    f = ihi | 0x3fe00000;
    g = f & 0x3fffffff;
    d = g - ihi;
    a = __hiloint2double(g, ilo); 
    y = rsqrt (a);
    h = __dmul_rn (y, y);
    l = __fma_rn (y, y, -h);
    e = __fma_rn (l, -a, __fma_rn (h, -a, 1.0));
    /* Round as shown in Peter Markstein, "IA-64 and Elementary Functions" */
    y = __fma_rn (__fma_rn (0.375, e, 0.5), e * y, y);
    d = d >> 1;
    a = __hiloint2double(__double2hiint(y) + d, __double2loint(y));
  } else if (a == 0.0) {
    a = __hiloint2double ((ihi & 0x80000000) | 0x7ff00000, 0x00000000);
  } else if (a < 0.0) {
    a = __hiloint2double (0xfff80000, 0x00000000);
  } else if (isinf (a)) {
    a = __hiloint2double (ihi & 0x80000000, 0x00000000);
  } else if (isnan (a)) {
    a = a + a;
  } else {
    a = a * __hiloint2double (0x7fd00000, 0);
    y = rsqrt (a);
    h = __dmul_rn (y, y);
    l = __fma_rn (y, y, -h);
    e = __fma_rn (l, -a, __fma_rn (h, -a, 1.0));
    /* Round as shown in Peter Markstein, "IA-64 and Elementary Functions" */
    y = __fma_rn (__fma_rn (0.375, e, 0.5), e * y, y);
    a = __hiloint2double(__double2hiint(y) + 0x1ff00000,__double2loint(y));
  }
  return a;
}

The code works quite well!
I actually have two kinds of Jacobi rotations, the hyperbolic and the trigonometric ones.
In one variant of the algorithm only the trigonometric rotations are applied and accumulated, while both are used in the other.
The prelimiary results suggest that in both variants the orthogonality of the accumulated rotations has increased with a correctly rounded rsqrt, compared with the “default” one.
In the second variant, with mixed rotations, the orthogonality and the relative accuracy of the final result is now better or at par with the previously best results, computed without rsqrt.
In the first variant that is not the case, but at least there is an improvement over the old rsqrt-based results.

Even though there is still a lot work to do before anything conclusive might be said, I’m very convinced of the usefulness of your implementation. Unfortunately, I sent the RFE immediately when you had suggested to do so, but I left a link to this discussion and I feel everything relevant is here.

I’d be happy to acknowlede your help in a paper I’m writing.
If that’s OK with you, please send to my_login_here@fsb.hr how you would like to be titled.

I am pleased to hear that my prototype implementation works well in the context of your use case.

You can update bug reports (including RFEs) as many times as you desire. Since a quantification of the advantages of a proposed enhancement is valuable input to the prioritization process, it would be best if you could append a brief summary of the advantages of the correctly rounded rsqrt() for your use case. Thanks.

I will contact you by personal message (PM) via the forum regarding the acknowledgement.