Conversion of double to floats How to convert one double to two floats to use in single precision GP

TriMiro · December 12, 2008, 7:05pm

Hi,

I have only a 32-bit GPU ( GF8600GT) and I want to do Numerical Analysis on it. However, I need to carry most of my operations in 64-bit (double) precision.

Short description of the problem:
It is mathematically possible to take a double D and represent it as the sum of two floats Fb and Fs (big and small float). I should be able to do that without any loss of precision, I want to have D - Fb - Fs == 0. The problem is that when the compiler typecasts, it rounds as opposed to truncate.

Fb = (float) D;
Fs = (float) ( D - (double) Fb );
cout << D - Fb - Fs << endl;

does not work. There is some error. Unless the exponanta for the double is too-large (and it is not in this case, all numbers are between 0 and 10) the double could be split.

Does anyone know how to typecast with truncation instead of rounding?

What I am trying to achieve at the end:
I am using a gmres solver with a preconditioner and the largest amount of time spend in my code in exactly in computing the action of the preconditioner on a vector. What I want to do is convert the preconditioner matrix P to a float matrix and load it into the GPU memory. Then every time I need to compute its action on a vector I would need to move the vector to the GPU do the computations and take it back (I can do that with cublas). Since the original vector is double precision, I want to split it into two float vectors and the matrix vector interaction once for each vector, it is 2 times slower, but should result in gain in performance since the GPU is more than two times faster than the CPU.

CUDA_Moose · December 12, 2008, 10:08pm

You should read this about floating point numbers: [url=“http://docs.sun.com/source/806-3568/ncg_goldberg.html”]http://docs.sun.com/source/806-3568/ncg_goldberg.html[/url]

I don’t think it’s possible to do any floating point calculation without losing a little precision. There is always a small amount of error because of the limitations of the floating point representation.

Also, I don’t think it’s always possible to represent any double as the sum of two floats. Doubles have a bigger range because they use more exponent bits. Supposing the exponent bits weren’t an issue, you could use bit operations and access the mantissa of the double to do what you want.
[url=“Single-precision floating-point format - Wikipedia”]http://en.wikipedia.org/wiki/Single_precision[/url]
[url=“Double-precision floating-point format - Wikipedia”]http://en.wikipedia.org/wiki/Double_precision[/url]

I’m not sure, but the newer version of cublas might support doubles? See [url=“http://forums.nvidia.com/lofiversion/index.php?t76614.html”]http://forums.nvidia.com/lofiversion/index.php?t76614.html[/url] Or you could buy a 1.3 capable cuda card.

e.ping · December 13, 2008, 12:49am

GMRES works well in mixed precision approaches if done right, so either run DP all the way or read up on mixed precision schemes. Or don’t, because you gave the answer yourself: precondition in lower prec. I’ve spent quite some time on this for multigrid and simple Krylov subspace solvers, the trick is to identify where exactly you need high prec…

TriMiro · December 13, 2008, 1:35am

The will be loss of precision in the preconditioner, but preconditioners don’t have to be exact anyway.

Theoretically the only problem with representing the double as floats can happen in very large exponentas. As long as the exponents work, it is possible to “split” the double.

My issue was compiler/C++ code. For hours of research it seems that there might be no clean way of typecasting without rounding.

I tried some things and there was significant error between the fully DP gmres and the mixed one that I uses. I am not doing something right, or my problem is unsuitable for this type of method (probably the former). Dominik G, do you know of a paper that discussed mixed methods (I will try find one myself, but if you know it off the top of your head I will appreciate it).

alex_dubinsky · December 13, 2008, 2:21am

What operations does preconditioning entail? (Just addition?)

I’ve seen a couple libraries on the forums that do extended precision arithmetic with two floats. Do a google search of the forum (site:forums.nvidia.com).

What exactly is your question about typeconversion? Do you mean double to float on the x86 host? I think you have to flip a flag to enable “round to zero.” You can also do it via casting to int and bit-manipulations. Another thing you might look into is fixed-point arithmetic (ie integers with a decimal point), since all your floating-point numbers are actually in a specific range.

TriMiro · December 13, 2008, 3:04am

Preconditioner is a factorized block diagonal matrix (with rather large blocks). To compute Px I need to solve several systems Lx = b, where L is lower triangular. It is more than addition and it is hence more complicated.

Example of my conversion problem: Suppose double uses 6 bit mantissa and float has only 3 (ignore the exponanta for the purpose of the argument assume it is decimal).

D = 1.10101E+0

Obviously we can represent D = Fb + Fs as:

Fb = 1.10E+0

Fs = 1.01E-3

My problem is that Fb = (float) D returns Fb = 1.11E+0. Typecasting seems to round numbers as opposed to truncate and I want it to truncate (AMD X2 64 gcc).

Bitwise operations would work, but they will be potentially too-slow. I may try it at some point if nothing else works.

Fixed precision might work if I use long int and regular int.

TriMiro · December 13, 2008, 4:38am

OK, there is another way to do this. Thanks to Dominik G, I looked for mixed precision and here is what I got.

Suppose we have a system of equations Ax = b. Suppose that there is no error in converting A from single to double and back (work for preconditioners that don’t have to be exact anyway). b is in double precision.

Convert A and b to single precision and compute x_1 = inverse(A) * b ( i.e. solve the system cheaply using floats )

Take x_1, convert it to doubles ( x = (double) x_1 ) and compute the residual r = b - A*x in doubles.

Convert A and r to singe precision and compute x_2 = inverse(A) *r.

Take x = x_1 + x_2 (in doubles).

The idea is that on every step we gain more precision in the solution. This seems to work for my problem (at least on the test that I used). 2 iterations converged sufficiently for gmres. I am yet to code this to cublas and test to see the performance difference (hopefully I will gain improvement).

Simon_Green · December 15, 2008, 12:19pm

You might also be interested in the “double-single” representation:
[url=“http://crd.lbl.gov/~dhbailey/mpdist/index.html”]http://crd.lbl.gov/~dhbailey/mpdist/index.html[/url]

[url=“http://forums.nvidia.com/index.php?showtopic=73067”]http://forums.nvidia.com/index.php?showtopic=73067[/url]

Topic		Replies	Views
Fast float to double conversion CUDA Programming and Performance	2	3074	August 13, 2015
Emulated double precision Double single routine header CUDA Programming and Performance	24	49162	October 18, 2010
error when trying to use half (fp16) CUDA Programming and Performance	16	20020	October 13, 2015
Floating-point precision problems CUDA Programming and Performance	14	4414	January 7, 2011
float / double issue CUDA Programming and Performance	12	22111	December 31, 2010
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8231	September 9, 2011
Float vs double precision on Titan V CUDA Programming and Performance	2	931	April 11, 2020
quads in cuda? CUDA Programming and Performance	7	9892	August 3, 2011
Floats and floats... difference between CPU and GPU? CUDA Programming and Performance	12	14156	February 2, 2010
Compile float as 64bit floating point CUDA Programming and Performance	7	1515	September 25, 2016

Conversion of double to floats How to convert one double to two floats to use in single precision GP

Related topics