I’m doing scientific calculations which require 128 bit precision and after doing a number of searches I haven’t been able to find anything and was wondering. Is cuda capable of handling quad precision?

Not natively. But with [font=“Courier New”]uint4[/font] there is a 128 bit data type (with native load+store operations) on which you could define your own operations.

CUDA does not currently have native support for floating-point operands and operation with more than double precision accuracy.

Would it be possible to expand on your requirements a bit? What are you currently using for your host-based computations? Do you need IEEE-754 compliant quadruple precision, or would double-double arithmetic suffice? What operations do you need to perform in the higher precision? Just addition, multiplication, division, and square root or also operation beyond this? Is the higher precision required throughput the application or just in particular parts of it? For example, it only accurate summation is required, this may be solvable by using Kahan summation.

The code is written in fortran, and due to lack of 128 bit support in most legacy fortran compilers we have alternate code paths which use the double-double workaround. (But it’s much nicer to be able to declare quads or real*16.) Most of the numbers in the code are doubles, but quads are being used to prevent loss of precision in 2 areas with some rather intense mathematical formulas. I need to be able to do all basic math operations as well as squares and square roots on the quads.

Basically the code is a group of 6 nested infinite sums, with 60 different possible cases for the final calculation depending on the value you have at that point.

I’d like to be able to run say the first 3 loops on the CPU, then create a couple hundred threads which are given the values at the end of loop 3, and have the billions of values in the 3 loops below that done on the GPU by having it execute the loops and evaluate the numbers based upon the 60 different case statements.

So at this point I’m just investigating whether I could package up the 3 inside nested infinite sums and the 60 different functions called via case statements and send them to the GPU, or if I would have to collapse all the code into a single function to make this work, as well at the limitations on how much code I can send to the GPU.

If it is possible that you can compute the terms of the infinite sums independently and use parallel reductions to sum them up.

Parallel reductions help immensely with round off error because at each step floating point numbers of similar magnitude are added.

Since you are using double-double on the host, why not continue to do so on the GPU? An IEEE-754 quadruple precision implementation would be pretty slow, I would estimate throughput at about 1/100 of native double precision. Double-double is going to be quite a bit faster. Below is code I have used for reference implementations of solvers and other codes, on Linux and Windows. The double-double operands are stored in double2 (a built-in CUDA type), with the least-significant part (“tail”) stored in the x-component, and the most-significant part (“head”) stored in the y-component. The GPU hardware provides 128-bit loads and stores that are used for double2, which has a 16-byte alignment requirement.

A few notes regarding the functions below. The multiplication, division, and square root are based on 2-digit longhand arithmetic, taking advantage of FMA (fused muliply-add). You may wonder why the addition requires so many operations. Most people use double-double addition based straight on the algorithm in T. J. Dekker’s original paper [1], but that addition suffers from a loss of accuracy when the magnitude of the operands is within a factor of two and their signs differ. Subtractive cancellation however is exactly the kind of situation where I would want the additional bits provided by double-double arithmetic to be put to good use. So I am using an algorithm published in a recent paper which provides full accuracy throughout. Eyeballing it, it appears to be based on Kahan’s work on accurate summation.

I have experimentally established the following error bounds (these are not *proven* error bounds!):

```
addition rel. err < 2**-104
multiplication rel. err < 2**-104
division rel. err < 2**-103
square root rel. err < 2**-104
```

If you relax the error bound for the multiplication to 2**-103, you can omit the least significant partial product (tail x tail) in the multiplication code (see comment).

I am making no claims that the code below is optimal in any sense, but I believe this is a reasonable adaptation of double-double to the GPU and it has served me well in my own work.

[1] T.J. Dekker: A Floating-Point Technique for Extending the Available Precision. Numer. Math. 18, 224-242 (1971)

```
/* Based on: Andrew Thall, Extended-Precision Floating-Point Numbers for GPU
Computation. Retrieved from http://andrewthall.org/papers/df64_qf128.pdf
on 7/12/2011.
*/
__device__ double2 dbldbladd (double2 a, double2 b)
{
double2 z;
double t1, t2, t3, t4, t5, e;
t1 = a.y + b.y;
t2 = t1 - a.y;
t3 = (a.y - (t1 - t2)) + (b.y - t2);
t4 = a.x + b.x;
t2 = t4 - a.x;
t5 = (a.x - (t4 - t2)) + (b.x - t2);
t3 = t3 + t4;
t4 = t1 + t3;
t3 = t3 + (t1 - t4);
t3 = t3 + t5;
z.y = e = t4 + t3;
z.x = t3 + (e - t4);
return z;
}
/* Take full advantage of FMA. Only 8 DP operations */
__device__ double2 dbldblmul (double2 x, double2 y)
{
double e;
double2 t, z;
t.y = __dmul_rn (x.y, y.y); /* prevent FMA-merging */
t.x = __fma_rn (x.y, y.y, -t.y);
t.x = __fma_rn (x.x, y.x, t.x); /* remove this for even higher performance */
t.x = __fma_rn (x.y, y.x, t.x);
t.x = __fma_rn (x.x, y.y, t.x);
z.y = e = t.y + t.x;
z.x = (t.y - e) + t.x;
return z;
}
/* Take full advantage of the FMA (fused-multiply add) capability of the GPU.
Straighforward long-hand division that treats each double as a "digit".
*/
__device__ double2 dbldbldiv (double2 a, double2 b)
{
double2 z;
double c, cc, up;
c = a.y / b.y; /* compute most significant quotient "digit" */
cc = fma (b.y, -c, a.y);
cc = fma (b.x, -c, cc);
cc = cc + a.x; /* remainder after most significant quotient "digit" */
cc = cc / b.y; /* compute least significant quotient "digit" */
z.y = up = c + cc;
z.x = (c - up) + cc;
return z;
}
/* Based on the binomial theorem. x = (a+b)^2 = a*a + 2*a*2b + b*b. If |b| is
much smaller than |a|, then x ~= a*a + 2*a*b, thus b = (x - a*a) / (2*a).
*/
__device__ double2 dbldblsqrt (double2 x)
{
double e;
double2 t, z;
t.y = sqrt (x.y);
t.x = __fma_rn (t.y, -t.y, x.y);
t.x = t.x + x.x;
t.x = t.x / (t.y + t.y);
z.y = e = t.y + t.x;
z.x = (t.y - e) + t.x;
return z;
}
```

I see what I was thinking of as double-double was slightly different from your example. I basically stored 128-bit numbers into a string and worked them in 2 chunks by casting the halves into doubles and storing over carryover bits using my own math functions. So it’s essentially the same thing, although this is likely much faster without using strings and casting between types.

So as far as setting the double-double equal to the actual value can I simply set it equal to the whole number and it’ll implicitly split it? Or would I have to split the number myself and set the *.x and *.y components of the double2 myself?

For example, I use code like this to test the precision of a particular compiler and see where I lose the first digit. Would this essentially work?

double2 a, b, c

a = 2.01234567890123456789012345678901234567890

b = 1.01234567890123456789012345678901234567890

c= dbldbladd(a, (b*-1))

write a, b, c

And thanks for all the help!

You could look to that source code GQD. The double-double is ported to the gpu.

http://code.google.com/p/gpuprec/

MickaÃ«l