Unexpected rounding behavior in HFMA


I bumped into some strange issues while testing my HGEMM implementation. The tests pass against cuBLAS but fail against half.sourceforge.net. After hours of debugging I was able to get a minimum example (I apologize for the ugly-looking floats below – this is a very rare behavior that seems to happen less than 0.01% of the time on random computations)

Anyway, my understanding is that the following PTX code seems to exhibit unexpected rounding behavior:

.version 5.0
.target sm_60
.address_size 64
.entry debug(.param .u64 _dbg){
  .reg .u64 %dbg;
  .reg .b16 %hacc, %ha, %hb;
  //Conversion actually doesn't occur because all these numbers can be represented exactly
   cvt.rn.f16.f32 %hacc, 0.40576171875; 
   cvt.rn.f16.f32 %ha, 0.1300048828125;
   cvt.rn.f16.f32 %hb, 0.2469482421875;
   fma.rn.f16 %hacc, %ha, %hb, %hacc;
   ld.param.u64 %dbg, [_dbg];
   st.global.b16 [%dbg], %hacc;
*dbg = 0.437744140625

This seems incorrect:

The result of the FMA (with fp32 multiplication and addition as per the PTX documentation) is:

which is evenly between the two consecutive half-floats:

0.437744140625 = 0.25 * 1.7509765625 – the fractional part is 1100000001 (odd)
0.43798828125 = 0.25 * 1.751953125 – the fractional part is 1100000010 (even)

It looks as though, in this particular case, the fma instruction gets confused and rounds to nearest-odd instead of nearest-even as specified by the PTX documentation.

As a side note, the normal cvt instruction seems to get it right:

.version 5.0
.target sm_60
.address_size 64
.entry debug(.param .u64 _dbg){
  .reg .u64 %dbg;
  .reg .b16 %hacc;
   cvt.rn.f16.f32 %hacc, 0.4378662109375;
   ld.param.u64 %dbg, [_dbg];
   st.global.b16 [%dbg], %hacc;
*dbg = 0.43798828125

This does look like a bug to me, but I may be missing something on the behavior of HFMA.


I did not double-check the assertion that all three source operands are representable as IEEE-754 binary16 numbers. The infinitely precise result of the FMA is r = 0.43786619603633880615234375

r - 0.437744140625 = 1.2205541133880615234375E-4
r - 0.43798828125 = -1.2208521366119384765625E-4

This means that 0.437744140625 is closer to the infinitely precise result than 0.43798828125, so the correct FMA result is 0.437744140625. Which is what fma.rn.f16 returns.

Emulating FMA correctly is not a trivial task (been there, done that, got the t-shirt). You may want to double-check your emulation code.

Oh, I see. That’s very interesting. I was under the wrong impression that the HW implementation of FMA gave a result equivalent to:

fma(ha, hb, hc) = round16(round32(ha)*round32(hb) + round32(hc))

But apparently it’s more like:

fma(ha, hb, hc) = round16(round64(ha)*round64(hb) + round64(hc))

Is this correct, or is the reality even more complicated than that?

The important property of FMA is that the full product (not rounded, not truncated) participates in the addition.

As far as I know, it is not generally true that FMA_binary16 (ha, hb, hc) = (binary16)FMA_binary32 ((binary32)ha, (binary32)hb, (binary32)hc), due to an issue called double rounding. One can show that double rounding is innocuous for some operations, like square root, when using the next wider IEEE-754 format.

It is very unlikely that a hardware implementation of FMA_binary16 would be defective, because it can be tested exhaustively with 2**48 test cases. A single high-end multi-core CPU would require no more than a week to crunch through that many test vectors.

I am not sure what you are denoting with the rounding operations on the inputs, which (as you noted) are already exactly representable as half-precision/16 bit floating point numbers.

The problem is that, in your notation, you are comparing the FMA result against a reference of

round16(round32(ha * hb + hc))

which produces a different result than directly rounding to

round16(ha * hb + hc)

EDIT: I hadn’t seen the previous post when I’ve posted mine. Once again njuffa has given a better explanation faster than me.

Thanks a lot both of you. Everything’s clearer now!