Fast float to double conversion

In various circumstances, one comes across a situation where some initial approximation is computed in single precision for performance reasons, then the result is up-converted to double precision for the final computation steps. Unfortunately, the float-to-double conversion instruction provided by the GPU is itself one of the slower 64-bit operations. One particular such context would be a double-precision cbrt() implementation.

I was therefore looking for something faster, in particular for “DP lite” platforms like sm_5x, leading to the code below. Maybe others find this useful as well. This translate to five instructions on sm_3x (3 LOPs, 2 shifts) and four instructions on sm_5x (2 shifts, MOV32I, LOP3). A three-instruction sequence should be possible by changing the mask to 0x00ffffff and using __byte_perm(), but this would really narrow the supported range down drastically, to [0.5,8.0). Use of the bitfield insertion instruction may be worth looking at as well, but my understanding is that BFE and BFI are fairly low-throughput instructions, so I tried sticking to simple operations.

/* works correctly if argument is in +/-[2**-15, 2**17), or zero, infinity, NaN */
__device__ __forceinline__ 
double my_fast_float2double (float a)
{
    unsigned int ia = __float_as_int (a);
    return __hiloint2double ((((ia >> 3) ^ ia) & 0x07ffffff) ^ ia, ia << 29);
}

In the course of programming CUDA particle simulations there is quite a bit of casting to and from 64 bit to 32 bit and back.

As you know for such simulations mixed precision is the new normal, with energy/dose accumulations done in 64 bit, and movement (ray tracing etc) done in 32 bit or less.

This will certainly get used in some optical photon simulations I have been working on recently.

Thanks for posting and the explanation!

I was actually motivated to code this up by a question in the [cuda] section of Stackoverflow the other day, where it seemed prudent to calculate initial approximations to the real roots of a cubic polynomial in single precision, then switch to double precision for the final Newton iterations on the approximate roots.

In case anybody was wondering about the version of the above based on __byte_perm(), here it is:

/* works correctly if argument is in +/-[2**-1, 2**3), or zero, infinity, NaN */
double my_fast_float2double (float a)
{
    unsigned int ia = __float_as_int (a);
    return __hiloint2double (__byte_perm (ia >> 3, ia, 0x7210), ia << 29);
}

Both of these implementations also work for negative operands, I have added a “+/-” to the comments to make that clearer. The basic idea here is to simply insert 3-bit strings of all 0s or all 1s as the exponent field is widened from 8 to 11 bits. I haven’t come up with anything faster yet, which doesn’t mean such solutions couldn’t exist.