quads in cuda?

vibrantcascade · July 18, 2011, 2:07pm

I’m doing scientific calculations which require 128 bit precision and after doing a number of searches I haven’t been able to find anything and was wondering. Is cuda capable of handling quad precision?

tera · July 18, 2011, 3:55pm

Not natively. But with [font=“Courier New”]uint4[/font] there is a 128 bit data type (with native load+store operations) on which you could define your own operations.

njuffa · July 18, 2011, 6:07pm

CUDA does not currently have native support for floating-point operands and operation with more than double precision accuracy.

Would it be possible to expand on your requirements a bit? What are you currently using for your host-based computations? Do you need IEEE-754 compliant quadruple precision, or would double-double arithmetic suffice? What operations do you need to perform in the higher precision? Just addition, multiplication, division, and square root or also operation beyond this? Is the higher precision required throughput the application or just in particular parts of it? For example, it only accurate summation is required, this may be solvable by using Kahan summation.

vibrantcascade · July 19, 2011, 3:41pm

The code is written in fortran, and due to lack of 128 bit support in most legacy fortran compilers we have alternate code paths which use the double-double workaround. (But it’s much nicer to be able to declare quads or real*16.) Most of the numbers in the code are doubles, but quads are being used to prevent loss of precision in 2 areas with some rather intense mathematical formulas. I need to be able to do all basic math operations as well as squares and square roots on the quads.

Basically the code is a group of 6 nested infinite sums, with 60 different possible cases for the final calculation depending on the value you have at that point.

I’d like to be able to run say the first 3 loops on the CPU, then create a couple hundred threads which are given the values at the end of loop 3, and have the billions of values in the 3 loops below that done on the GPU by having it execute the loops and evaluate the numbers based upon the 60 different case statements.

So at this point I’m just investigating whether I could package up the 3 inside nested infinite sums and the 60 different functions called via case statements and send them to the GPU, or if I would have to collapse all the code into a single function to make this work, as well at the limitations on how much code I can send to the GPU.

akavo · July 19, 2011, 3:56pm

If it is possible that you can compute the terms of the infinite sums independently and use parallel reductions to sum them up.
Parallel reductions help immensely with round off error because at each step floating point numbers of similar magnitude are added.

njuffa · July 19, 2011, 8:41pm

Since you are using double-double on the host, why not continue to do so on the GPU? An IEEE-754 quadruple precision implementation would be pretty slow, I would estimate throughput at about 1/100 of native double precision. Double-double is going to be quite a bit faster. Below is code I have used for reference implementations of solvers and other codes, on Linux and Windows. The double-double operands are stored in double2 (a built-in CUDA type), with the least-significant part (“tail”) stored in the x-component, and the most-significant part (“head”) stored in the y-component. The GPU hardware provides 128-bit loads and stores that are used for double2, which has a 16-byte alignment requirement.

A few notes regarding the functions below. The multiplication, division, and square root are based on 2-digit longhand arithmetic, taking advantage of FMA (fused muliply-add). You may wonder why the addition requires so many operations. Most people use double-double addition based straight on the algorithm in T. J. Dekker’s original paper [1], but that addition suffers from a loss of accuracy when the magnitude of the operands is within a factor of two and their signs differ. Subtractive cancellation however is exactly the kind of situation where I would want the additional bits provided by double-double arithmetic to be put to good use. So I am using an algorithm published in a recent paper which provides full accuracy throughout. Eyeballing it, it appears to be based on Kahan’s work on accurate summation.

I have experimentally established the following error bounds (these are not proven error bounds!):

addition        rel. err < 2**-104

multiplication  rel. err < 2**-104

division        rel. err < 2**-103

square root     rel. err < 2**-104

If you relax the error bound for the multiplication to 2**-103, you can omit the least significant partial product (tail x tail) in the multiplication code (see comment).

I am making no claims that the code below is optimal in any sense, but I believe this is a reasonable adaptation of double-double to the GPU and it has served me well in my own work.

[1] T.J. Dekker: A Floating-Point Technique for Extending the Available Precision. Numer. Math. 18, 224-242 (1971)

/* Based on: Andrew Thall, Extended-Precision Floating-Point Numbers for GPU 

   Computation. Retrieved from http://andrewthall.org/papers/df64_qf128.pdf

   on 7/12/2011.

*/

__device__ double2 dbldbladd (double2 a, double2 b)

{

    double2 z;

    double t1, t2, t3, t4, t5, e;

t1 = a.y + b.y;

    t2 = t1 - a.y;

    t3 = (a.y - (t1 - t2)) + (b.y - t2);

    t4 = a.x + b.x;

    t2 = t4 - a.x;

    t5 = (a.x - (t4 - t2)) + (b.x - t2);

    t3 = t3 + t4;

    t4 = t1 + t3;

    t3 = t3 + (t1 - t4);

    t3 = t3 + t5;

    z.y = e = t4 + t3;

    z.x = t3 + (e - t4);

    return z;

}

/* Take full advantage of FMA. Only 8 DP operations */

__device__ double2 dbldblmul (double2 x, double2 y)

{

    double e;

    double2 t, z;

    t.y = __dmul_rn (x.y, y.y);     /* prevent FMA-merging */

    t.x = __fma_rn (x.y, y.y, -t.y);

    t.x = __fma_rn (x.x, y.x, t.x); /* remove this for even higher performance */

    t.x = __fma_rn (x.y, y.x, t.x);

    t.x = __fma_rn (x.x, y.y, t.x);

    z.y = e = t.y + t.x;

    z.x = (t.y - e) + t.x;

    return z;

}

/* Take full advantage of the FMA (fused-multiply add) capability of the GPU.

   Straighforward long-hand division that treats each double as a "digit".

 */

__device__ double2 dbldbldiv (double2 a, double2 b)

{

    double2 z;

    double c, cc, up;

c = a.y / b.y;     /* compute most significant quotient "digit" */

    cc = fma (b.y, -c, a.y); 

    cc = fma (b.x, -c, cc);

    cc = cc + a.x;     /* remainder after most significant quotient "digit" */

    cc = cc / b.y;     /* compute least significant quotient "digit" */

    z.y = up = c + cc;

    z.x = (c - up) + cc;

    return z;

}

/* Based on the binomial theorem. x = (a+b)^2 = a*a + 2*a*2b + b*b. If |b| is

   much smaller than |a|, then x ~= a*a + 2*a*b, thus b = (x - a*a) / (2*a).

*/

__device__ double2 dbldblsqrt (double2 x)

{

    double e;

    double2 t, z;

    t.y = sqrt (x.y);

    t.x = __fma_rn (t.y, -t.y, x.y);

    t.x = t.x + x.x;

    t.x = t.x / (t.y + t.y);

    z.y = e = t.y + t.x;

    z.x = (t.y - e) + t.x;

    return z;

}

vibrantcascade · August 2, 2011, 9:34pm

I see what I was thinking of as double-double was slightly different from your example. I basically stored 128-bit numbers into a string and worked them in 2 chunks by casting the halves into doubles and storing over carryover bits using my own math functions. So it’s essentially the same thing, although this is likely much faster without using strings and casting between types.

So as far as setting the double-double equal to the actual value can I simply set it equal to the whole number and it’ll implicitly split it? Or would I have to split the number myself and set the *.x and *.y components of the double2 myself?

For example, I use code like this to test the precision of a particular compiler and see where I lose the first digit. Would this essentially work?

double2 a, b, c

a = 2.01234567890123456789012345678901234567890
b = 1.01234567890123456789012345678901234567890
c= dbldbladd(a, (b*-1))

write a, b, c

And thanks for all the help!

mickaelg · August 3, 2011, 1:20pm

You could look to that source code GQD. The double-double is ported to the gpu.
http://code.google.com/p/gpuprec/

MickaÃ«l

Topic		Replies	Views
CPU and GPU Floating point anomaly CUDA Programming and Performance	10	5668	November 10, 2013
precision CUDA Programming and Performance	3	2615	December 16, 2008
error when trying to use half (fp16) CUDA Programming and Performance	16	20047	October 13, 2015
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10754	November 26, 2009
Maximum native precision or mixed-precision to have 21 digit floating point type? CUDA Programming and Performance mixed-precision	1	675	January 14, 2022
not sure where to ask this but CUDA can only handle 4 byte and 8 byte floating point? CUDA Programming and Performance	10	800	July 6, 2018
Floating Point Accuracy CUDA Programming and Performance	11	30432	April 6, 2013
Double double precision arithmetic library now available CUDA Programming and Performance	14	8447	July 2, 2013
floating point precision on CUDA CUDA Programming and Performance	11	14849	June 8, 2010
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	829	August 9, 2023

quads in cuda?

Related topics