I am using the __half
type from cuda_fp16.h
but have not been able to determine what the FLT_MAX is for this data type.
What is its value?
Is there a macro for this value?
I am using the __half
type from cuda_fp16.h
but have not been able to determine what the FLT_MAX is for this data type.
What is its value?
Is there a macro for this value?
From cuda_fp16.h
header file:
16 bits are being used in total: 1 sign bit, 5 bits for the exponent, and the significand is being stored in 10 bits. The total precision is 11 bits. There are 15361 representable numbers within the interval [0.0, 1.0], endpoints included. On average we have log10(2**11) ~ 3.311 decimal digits.
So basically this is IEEE 754-2008 compliant implementation of half-precision floating-point numbers. As far as I know DirectX is using the same format (see d3dx9math.h
and d3dx10math.h
from DirectX SDK), I am reproducing their definitions here, what you wanted seems to be D3DX_16F_MAX
and D3DX_16F_MIN
:
#define D3DX_16F_DIG 3 // # of decimal digits of precision
#define D3DX_16F_EPSILON 4.8875809e-4f // smallest such that 1.0 + epsilon != 1.0
#define D3DX_16F_MANT_DIG 11 // # of bits in mantissa
#define D3DX_16F_MAX 6.550400e+004 // max value
#define D3DX_16F_MAX_10_EXP 4 // max decimal exponent
#define D3DX_16F_MAX_EXP 15 // max binary exponent
#define D3DX_16F_MIN 6.1035156e-5f // min positive value
#define D3DX_16F_MIN_10_EXP (-4) // min decimal exponent
#define D3DX_16F_MIN_EXP (-14) // min binary exponent
#define D3DX_16F_RADIX 2 // exponent radix
#define D3DX_16F_ROUNDS 1 // addition rounding: near
#define D3DX_16F_SIGN_MASK 0x8000
#define D3DX_16F_EXP_MASK 0x7C00
#define D3DX_16F_FRAC_MASK 0x03FF
Bear in mind however, that this format has other value limitations as described here, not just min/max value limits.