I took the double emulation from the Mandelbrot example and made some adjustments to it. I converted the functions to operators that work with float2 and I added unary negation, division, and some extra functions to store/retrieve values in/from a double single float2. I put them together into a header file in case it’s useful for anyone else.
EDIT: I updated the header file. I made the existing * and / operators into “fast” functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.
EDIT: I updated the file again, removing the questionable “fast” functions and using the new multiply from Norbert Juffa posted here in the thread.
EDIT: Another update to include Reimar’s bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX. dsmath.h (5.7 KB)
Keep in mind however that, as hinted in this thread, there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the “fast” multiplication and division will not work on current GPUs.
It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the “fast” version is uh… faster when an FMA is present, and broken otherwise.
Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :
Note that CUDA 2.0 includes some new intrinsic functions, __fadd_rn() and __fmul_rn() that can be used to avoid the compiler using MADs.
Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.
Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).
/* For x and y of opposite sign whose magnitude is within a factor of two
of each other either variant below loses accuracy. Otherwise the result
is within 1.5 ulps of the correctly rounded result with 48-bit mantissa.
Based on the dsadd() function in D.H. Bailey's dsfun package from netlib.
__device__ float2 dblsgl_add (float2 x, float2 y)
volatile float t1, t2, e;
float t1, t2, e;
t1 = x.y + y.y;
e = t1 - x.y;
t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;
z.y = e = t1 + t2;
z.x = t2 - (e - t1);
/* Based on: Guillaume Da Graça, David Defour. Implementation of Float-Float
* Operators on Graphics Hardware. RNC'7, pp. 23-32, 2006.
__device__ float2 dblsgl_mul (float2 x, float2 y)
volatile float up, vp, u1, u2, v1, v2, mh, ml;
float up, vp, u1, u2, v1, v2, mh, ml;
up = x.y * 4097.0f;
u1 = (x.y - up) + up;
u2 = x.y - u1;
vp = y.y * 4097.0f;
v1 = (y.y - vp) + vp;
v2 = y.y - v1;
mh = __fmul_rn(x.y,y.y);
ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;
ml = (__fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;
z.y = up = mh + ml;
z.x = (mh - up) + ml;
My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I’ve removed the fast functions.
The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I’ve included it in the header file, albeit preserving the word order I had from DSFUN. I hope that’s OK.
I can understand that this is preferrable, but there’s still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.
Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b’s mantissa by simply truncating b to a float. Then, in (b - a.x), a.x’s value is promoted to a double, and then the promoted value is subtracted from b’s value. The resulting value should be equivalent to b’s value with the first 23 bits from b’s mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b’s mantissa in a.y.
I’ve been trying to write a sqrt function with emulated double precision, and while it works in emulation mode, it gives the wrong result when running on the hardware. The code is:
// This subroutine employs the following formula (due to Alan Karp):
// sqrt(d) = (d * x) + 0.5 * [d - (d * x)^2] * x (approx.)
// where x is a single precision approximation to the reciprocal square root,
// and where the multiplications d * x and  * x are performed with only
// single precision.
if (f2.x == 0)
float t1, t2, t3;
float2 s0, s1;
t1 = rsqrtf(f2.x);
t2 = __fmul_rn(f2.x, t1);
s0.x = t2 * t2;
s0.y = t2 * t2 - s0.x; // PROBLEM HERE
s1 = f2 - s0;
t3 = float(0.5) * s1.x * t1;
s0.x = t2;
s0.y = 0;
s1.x = t3;
s1.y = 0;
return s0 + s1;
The problem occurs at “PROBLEM HERE” in that this value always is 0 when not running in emulation mode. The .ptx file up to that part is:
When you port routines over from DSFUN, make sure that you use the parts that don’t assume the presence of a FMAD instruction. Furthermore, my experiments indicate that using the multiplication operator instead of __fmul_rn has a fairly negligible effect on the accuracy of the double single operations.
I’ve ported most of the double single routines over to CUDA now, but it’s not clear to me how that works with licensing (DSFUN’s license is rather vague) so I haven’t posted the results here.
So just changing that one line ( s0.y = t2 * t2 - s0.x; ) so that it does the two operations using emulated double precision (putting t2 and s0.x into the higher order part and leaving the lower order part of the new double single variables zero), then saving the higher order part of that result in s0.y totally works, thanks dudes.
StickGuy, I need yuor help (sorry for my english).
I try use your header file on trivial example.
/* Kernel */
global void square_elements(float2* in, float2* out, int N)
int idx = blockIdx.x*blockDim.x+threadIdx.x;
if ( idx < N) out[idx]=in[idx]*in[idx];
void mexFunction(int nlhs, mxArray *plhs,
int nrhs, const mxArray *prhs)
int i, j, m, n;
double *data1, *data2;
float2 *data1f, *data2f;
/* on GPU */
float2 *data1f_gpu, *data2f_gpu;
if (nrhs != nlhs)
mexErrMsgTxt("The number of input and output arguments must be the same.");
for (i = 0; i < nrhs; i++)
/* Find the dimensions of the data */
m = mxGetM(prhs[i]);
n = mxGetN(prhs[i]);
/* Create an mxArray for the output data */
plhs[i] = mxCreateDoubleMatrix(m, n, mxREAL);
/* Create an input and output data array on the GPU*/
cudaMalloc( (void **) &data1f_gpu,sizeof(float2)*m*n);
cudaMalloc( (void **) &data2f_gpu,sizeof(float2)*m*n);
/* Retrieve the input data */
data1 = mxGetPr(prhs[i]);
/* Check if the input array is single or double precision */
category = mxGetClassID(prhs[i]);
if( category == mxSINGLE_CLASS)
/* The input array is single precision, it can be sent directly to the
cudaMemcpy( data1f_gpu, data1, sizeof(float)*m*n,
if( category == mxDOUBLE_CLASS)
/* The input array is in double precision, it needs to be converted t
floats before being sent to the card */
data1f = (float2 *) mxMalloc(sizeof(float2)*m*n);
for (j = 0; j < m*n; j++)
cudaMemcpy( data1f_gpu, data1f, sizeof(float)*n*m, cudaMemcpyHostToDevice);
data2f = (float2 *) mxMalloc(sizeof(float2)*m*n);
/* Compute execution configuration using 128 threads per block */
if ( (n*m) % 128 !=0 ) dimGrid.x+=1;
/* Call function on GPU */
square_elements<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, n*m);
/* Copy result back to host */
cudaMemcpy( data2f, data2f_gpu, sizeof(float)*n*m, cudaMemcpyDeviceToHost);
/* Create a pointer to the output data */
data2 = mxGetPr(plhs[i]);
/* Convert from single to double before returning */
for (j = 0; j < m*n; j++)
data2[j] = data2f[j].x + data2f[j].y;
/* Clean-up memory on device and host */
When I use type float2 istead of float, I receive almost identical results .
I will be very grateful to you, if you prompt to me in what the reason of this problem.
I’ve been using the double precision emulation with success in complex vector and matrix algebra routines.
My question is: Can this computation principle be extended with three, four, five floats to get something like an arbitrary precision maths implementation? I’d love to create some Mandelbrot (or 3D Mandelbulb) code that dynamically adds more and more floats (precision!) as you zoom in - but I have no idea how to attack this.