In various circumstances, one comes across a situation where some initial approximation is computed in single precision for performance reasons, then the result is up-converted to double precision for the final computation steps. Unfortunately, the float-to-double conversion instruction provided by the GPU is itself one of the slower 64-bit operations. One particular such context would be a double-precision cbrt() implementation.

I was therefore looking for something faster, in particular for “DP lite” platforms like sm_5x, leading to the code below. Maybe others find this useful as well. This translate to five instructions on sm_3x (3 LOPs, 2 shifts) and four instructions on sm_5x (2 shifts, MOV32I, LOP3). A three-instruction sequence should be possible by changing the mask to 0x00ffffff and using __byte_perm(), but this would really narrow the supported range down drastically, to [0.5,8.0). Use of the bitfield insertion instruction may be worth looking at as well, but my understanding is that BFE and BFI are fairly low-throughput instructions, so I tried sticking to simple operations.

```
/* works correctly if argument is in +/-[2**-15, 2**17), or zero, infinity, NaN */
__device__ __forceinline__
double my_fast_float2double (float a)
{
unsigned int ia = __float_as_int (a);
return __hiloint2double ((((ia >> 3) ^ ia) & 0x07ffffff) ^ ia, ia << 29);
}
```