Conversion of double to floats How to convert one double to two floats to use in single precision GP


I have only a 32-bit GPU ( GF8600GT) and I want to do Numerical Analysis on it. However, I need to carry most of my operations in 64-bit (double) precision.

Short description of the problem:
It is mathematically possible to take a double D and represent it as the sum of two floats Fb and Fs (big and small float). I should be able to do that without any loss of precision, I want to have D - Fb - Fs == 0. The problem is that when the compiler typecasts, it rounds as opposed to truncate.

Fb = (float) D;
Fs = (float) ( D - (double) Fb );
cout << D - Fb - Fs << endl;

does not work. There is some error. Unless the exponanta for the double is too-large (and it is not in this case, all numbers are between 0 and 10) the double could be split.

Does anyone know how to typecast with truncation instead of rounding?

What I am trying to achieve at the end:
I am using a gmres solver with a preconditioner and the largest amount of time spend in my code in exactly in computing the action of the preconditioner on a vector. What I want to do is convert the preconditioner matrix P to a float matrix and load it into the GPU memory. Then every time I need to compute its action on a vector I would need to move the vector to the GPU do the computations and take it back (I can do that with cublas). Since the original vector is double precision, I want to split it into two float vectors and the matrix vector interaction once for each vector, it is 2 times slower, but should result in gain in performance since the GPU is more than two times faster than the CPU.

You should read this about floating point numbers:

I don’t think it’s possible to do any floating point calculation without losing a little precision. There is always a small amount of error because of the limitations of the floating point representation.

Also, I don’t think it’s always possible to represent any double as the sum of two floats. Doubles have a bigger range because they use more exponent bits. Supposing the exponent bits weren’t an issue, you could use bit operations and access the mantissa of the double to do what you want.

I’m not sure, but the newer version of cublas might support doubles? See Or you could buy a 1.3 capable cuda card.

GMRES works well in mixed precision approaches if done right, so either run DP all the way or read up on mixed precision schemes. Or don’t, because you gave the answer yourself: precondition in lower prec. I’ve spent quite some time on this for multigrid and simple Krylov subspace solvers, the trick is to identify where exactly you need high prec…

The will be loss of precision in the preconditioner, but preconditioners don’t have to be exact anyway.

Theoretically the only problem with representing the double as floats can happen in very large exponentas. As long as the exponents work, it is possible to “split” the double.

My issue was compiler/C++ code. For hours of research it seems that there might be no clean way of typecasting without rounding.

I tried some things and there was significant error between the fully DP gmres and the mixed one that I uses. I am not doing something right, or my problem is unsuitable for this type of method (probably the former). Dominik G, do you know of a paper that discussed mixed methods (I will try find one myself, but if you know it off the top of your head I will appreciate it).

What operations does preconditioning entail? (Just addition?)

I’ve seen a couple libraries on the forums that do extended precision arithmetic with two floats. Do a google search of the forum (

What exactly is your question about typeconversion? Do you mean double to float on the x86 host? I think you have to flip a flag to enable “round to zero.” You can also do it via casting to int and bit-manipulations. Another thing you might look into is fixed-point arithmetic (ie integers with a decimal point), since all your floating-point numbers are actually in a specific range.

Preconditioner is a factorized block diagonal matrix (with rather large blocks). To compute Px I need to solve several systems Lx = b, where L is lower triangular. It is more than addition and it is hence more complicated.

Example of my conversion problem: Suppose double uses 6 bit mantissa and float has only 3 (ignore the exponanta for the purpose of the argument assume it is decimal).

D = 1.10101E+0

Obviously we can represent D = Fb + Fs as:

Fb = 1.10E+0

Fs = 1.01E-3

My problem is that Fb = (float) D returns Fb = 1.11E+0. Typecasting seems to round numbers as opposed to truncate and I want it to truncate (AMD X2 64 gcc).

Bitwise operations would work, but they will be potentially too-slow. I may try it at some point if nothing else works.

Fixed precision might work if I use long int and regular int.

OK, there is another way to do this. Thanks to Dominik G, I looked for mixed precision and here is what I got.

Suppose we have a system of equations Ax = b. Suppose that there is no error in converting A from single to double and back (work for preconditioners that don’t have to be exact anyway). b is in double precision.

Convert A and b to single precision and compute x_1 = inverse(A) * b ( i.e. solve the system cheaply using floats )

Take x_1, convert it to doubles ( x = (double) x_1 ) and compute the residual r = b - A*x in doubles.

Convert A and r to singe precision and compute x_2 = inverse(A) *r.

Take x = x_1 + x_2 (in doubles).

The idea is that on every step we gain more precision in the solution. This seems to work for my problem (at least on the test that I used). 2 iterations converged sufficiently for gmres. I am yet to code this to cublas and test to see the performance difference (hopefully I will gain improvement).

You might also be interested in the “double-single” representation: