cramming 3 unsigned 20-21 bit floating point values into 64 bits?

I really would love to get away from the float3 type and use 64 bits to represent 3 unsigned float values.

In the past I have successfully bit packed 32 bit floats to 16 bit unsigned int representation and back to 32 bit floats, but this is slightly more complicated and I was wondering if any work has been done in this area.

The motivation is that my applications are running out of memory resources, (both shared and registers), and being able to represent 3 floating point values (x,y,z) in 64 bits may improve my occupancy.

Fortunately I know ahead of time that (x,y,z) will all be positive and probably have a discrete range of (0:1023) in floating point representation.
Will not use those 64 bit fields for accumulations so I think overflow would not be an issue, and I would like to squeeze as much precision as I can out of each 20-21 bit section. I am guessing I would need to use an unsigned long long rather than deal with something like a uint2.

I image there will be lots of reinterpret_cast () s, but would like to see if that overhead is offset by the by the possible occupancy benefits.

Did Google this but did not find anything, and I have a general idea of how to proceed, but if anyone has a link/reference to similar work it would be appreciated. I am sure I am not the only one who has considered this tactic.

Thanks…

Never mind, I was able to figure it out and it seems to be valid for my use, at least within 3 decimals to the right.

I ended up using an unsigned long long type and broke it down into three groups of 21 bits.

I actually had one more question;

will there be any significant difference(accuracy or performance) between doing explicit C style casting,or using the CUDA math library intrinsic casting device functions like __float2ull_rn() ?

In the past I used the CUDA casting intrinsics, but always wanted to ask.

I would think that using the “round to even” version would maybe have better consistency?

C/C++ style casting and the CUDA casting intrinsics map to the same set of hardware instructions, such as I2F, F2I, F2F. The difference is that the C/C++ casts each prescribe a particular rounding mode, whereas the device intrinsics make all relevant IEEE-754 rounding codes available. This can save instructions, e.g. instead of i=(unsigned int)rintf(f), which is two instructions (F2F followed by F2I) one can use __float2int_rn(), which maps to a single F2I instruction. In many cases round-to-nearest-even will provide a bit more accuracy, and more importantly it may prevent biased drift in the data.

From a performance perspective, it is important to know that all conversions from and to 64-bit types must go through the DP datapath, so will be slow on consumer cards. Note that this applies to the intrinsic __float2ull_rn(). The reason for mapping these conversions to the DP path is that this is the only unit with a 64-bit source and result bus.

In the simplest case you may be able to simply grab the top-most 21 bits of each float variable and pack three of those into a 64-bit unsigned long long. That retains 12 bits of mantissa and would be roughly equivalent to converting the operands to FP16. However, FP16 allows denser packing, in that 64 bits comprise four FP16 operands. So it is not clear to me why you want to adopt custom 3:1 packing.

Given that you know the operands are in [0,1023], I would suggest considering using a 10.6 fixed-point format for each piece of data, which allows to store four operands into a 64-bit unsigned long long. Whether this retains more accuracy than compressing into four FP16 operands will depend on your use case.In case those operands are actually integers in [0,1023] you could of course try to store three of them in a 32-bit unsigned int.

ushort4 us;
float a, b, c, d;

us.x = __float2half_rn(a);  // pack/unpack with FP16
us.y = __float2half_rn(b);
us.z = __float2half_rn(c);
us.w = __float2half_rn(d);
a = __half2float(us.x);
b = __half2float(us.y);
c = __half2float(us.z);
d = __half2float(us.w);

us.x = __float2uint_rn (64.0f * a);  // pack/unpack with 10.6 fixed point
us.y = __float2uint_rn (64.0f * b);
us.z = __float2uint_rn (64.0f * c);
us.w = __float2uint_rn (64.0f * d);
a = 1.5625e-2f * us.x;
b = 1.5625e-2f * us.y;
c = 1.5625e-2f * us.z;
d = 1.5625e-2f * us.w;

My motivation was to bring down my shared memory usage by packing 3 float values into 63 bits keeping as much precision as that would allow.

That implementation did increase occupancy and the packing/unpacking method worked correctly, but overall the running time was about the same. I suppose all that 64 bit integer work offset the occupancy gains.

Thanks, I may try 16 bit next.

If FP16 accuracy is insufficient, you could try a custom floating-point format that fits into a ushort. Obviously you don’t need the sign bit. As for the exponent bits, it is not clear how much dynamic range is needed. If you can guarantee that only 2**-5 <= |x| < 2**10 plus zero is required, you could utilize a 4-bit exponent field and a 12-bit mantissa field, giving accuracy of about 4 decimal place.

Obviously the processing (packing/unpacking) would be somewhat more involved. If need be, you could extend this scheme to a 4-bit exponent plus 17-bit mantissa, packing 3:1 into a 64-bit unsigned long long. This would provide almost 6 decimal places but get even more expensive.

If the three values have correlated magnitudes, you could even try designing your own packing which has one exponent field shared between three mantissa fields. This might be applicable for example if you’re encoding an HDR light intensity where the component magnitudes are strongly correlated. If the values are unrelated, then the shared exponent probably isn’t useful.

You might allocate say 18 bits for each of three mantissas and 10 bits for the exponent. (No sign bit since you say your values are positive.) You would NOT have an implicit leading 1 in the mantissa in this case.
If you know the magnitude of the values is 1023, you could bias the exponent to max out at that value. A 7 bit exponent may even work OK, giving an extra mantissa bit.

@SPWorley: Good idea about block floating point, can’t even recall the last time I saw that being used.

Here is code for compressing float data into a custom 16-bit floating-point format that can store data in [0.03125, 1023.93750] plus zero in a ushort. The rounding in the conversion from ‘float’ is round-to-nearest-ties-away-from-zero for performance reasons. The maximum error in a round-trip-conversion is 2**-13 < 1.23e-4.

__forceinline__ __device__ unsigned short float2custom (float a)
{
    unsigned short r;
    int ia = __float_as_int (a);
    ia = ia + (1 << 10); // round
    r = (ia - (127 << 23) + (6 << 23)) >> 11;
    if (a < 0.03125f) ia = 0; // FTZ: flush to zero
    return r;
}

__forceinline__ __device__ float custom2float (unsigned short a)
{
    float r;
    int ia = (a - (6 << 12) + (127 << 12)) << 11;
    r = __int_as_float (ia);
    if ((a & 0xf000) == 0) r = 0.0f; // DAZ: denormals are zero
    return r;
}

float a, b, c, d;
ushort4 us;

us.x = float2custom (a);
us.y = float2custom (b);
us.z = float2custom (c);
us.w = float2custom (d);
a = custom2float (us.x);
b = custom2float (us.y);
c = custom2float (us.z);
d = custom2float (us.w);

Here is the equivalent code for the 21-bit format for data in [0.03125, 1023.99799], plus zero. Maximum round-trip error is 2**-18 < 3.82e-6.

__forceinline__ __device__ unsigned int float2customl (float a)
{
    unsigned int r;
    int ia = __float_as_int (a);
    ia = ia + (1 << 5);  // round
    r = (ia - (127 << 23) + (6 << 23)) >> 6;
    if (a < 0.03125f) ia = 0; // FTZ: flush to zero
    return r;
}

__forceinline__ __device__ float customl2float (unsigned int a)
{
    float r;
    int ia = (a - (6 << 17) + (127 << 17)) << 6;
    r = __int_as_float (ia);
    if ((a & 0x1e0000) == 0) r = 0.0f; // DAZ: denormals are zero
    return r;
}

float a, b, c, d;
unsigned long long ul;
  
ul = (((unsigned long long)float2customl(a) <<  0) | 
      ((unsigned long long)float2customl(b) << 21) |
      ((unsigned long long)float2customl(c) << 42));
a = customl2float ((unsigned int)(ul >>  0) & 0x1fffff);
b = customl2float ((unsigned int)(ul >> 21) & 0x1fffff);
c = customl2float ((unsigned int)(ul >> 42));