When using the __hfma function, the Device and Host results differ by 1 ULP, which should be 0ulp in theory

Hello, I encountered some precision-related issues when using CUDA’s __hfma function. I hope you can help me.

I called the __hfma (half-precision fused multiply-add) function in a CUDA program to perform calculations, but found that there was a 1 ULP (Unit in the Last Place) difference between the results calculated on the Device side and the Host side. The following is my test environment and code snippet:

Test environment:

GPU: [p4]
CUDA version: [12.1]
GPU architecture: [sm_60]
Compiler command: nvcc demo.cu -o demo -arch=sm_60

===================================================
include <cuda_fp16.h>
include <stdio.h>

// CUDA Kernel to perform hfma operation
global void hfmaKernel(half *a, half *b, half *c, half *result) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx == 0) {
// Perform hfma: result = a * b + c
result[idx] = __hfma(a[idx], b[idx], c[idx]);
}
}

int main() {

size_t size = sizeof(half);

// Host arrays
half *h_a, *h_b, *h_c, *h_result;

// Allocate host memory
h_result = (half *)malloc(size);

// Initialize host arrays
unsigned short a = 0xF8A6;
unsigned short b = 0x3B09;
unsigned short c =  0xB944;

h_a =  reinterpret_cast<half *>(&a);
h_b =  reinterpret_cast<half *>(&b);
h_c =  reinterpret_cast<half *>(&c);

// Device arrays
half *d_a, *d_b, *d_c, *d_result;

// Allocate device memory
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
cudaMalloc(&d_result, size);

// Copy data from host to device
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);

// Launch kernel
hfmaKernel<<<1, 1>>>(d_a, d_b, d_c, d_result);

cudaDeviceSynchronize();

// Copy result back to host
cudaMemcpy(h_result, d_result, size, cudaMemcpyDeviceToHost);

// do the workaroud for __hfma in host
float h_a_f = __half2float(*h_a);
float h_b_f = __half2float(*h_b);
float h_c_f = __half2float(*h_c);
float res = h_a_f * h_b_f + h_c_f;
half h_result_ref = __float2half(res);

// comapre the result
printf(" result in device:  %X vs %f \n", *h_result, __half2float(*h_result));
printf(" result in host:  %X vs %f \n", h_result_ref, __half2float(h_result_ref));

// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
cudaFree(d_result);

// Free host memory
free(h_result);

return 0;

}

==============================================
result is:

result in device: F817 vs -33504.000000
result in host: F816 vs -33472.000000

==============================================

What is the problem? Is the host’s reference calculation method inappropriate? Or is it the difference caused by the device calculation? Please give me some help, thank you.

when posting code on these forums, please format it correctly. one possible approach: edit your post by clicking on the pencil icon below it. Select the code in the edit window/pane. Then click the </> button at the toolbar at the top of the pane. Then save your changes. Please do that now. Thanks.

I have not examined your specific test vector, but generally speaking, this is not a correct way of emulating an FMA operation, due to double-rounding effects. Double rounding refers to a situation where a result is first rounded to an intermediate precision and subsequently to the target precision.

I am not aware of any simple way of emulating an FMA in bit-accurate fashion via computation in the next higher precision. This is one reason why many emulation-based implementations of fma() have bugs.

I have not completely convinced myself yet, but I think that emulating the half-precision FMA via double computation might work. One would obviously need to use the __double2half() conversion function in that case.

1 Like

The numerical values in the original code are:

a = -38080.000000000000000, b = 0.879394531250000, c = -0.658203125000000

computing a*b+c using the windows calculator app, I get:

-33488.001953125

If I focus on the value -33488, I observe that it is exactly halfway between the two computed representations:

That is, the distance from it to either of the computed representations (on the number line) is 16. Considering a round-to-nearest method, and the fact that the actual computed value is slightly below (more negative than) the midpoint, the correct result would be -33504.

0
...
-33472             host result
-33488             midpoint
-33488.001953125   correct result without rounding (i.e. "correctly rounded")
-33504             device result

What may be happening in the float case (the double-rounding referred to by njuffa):

-33488 is an exact representation for float (no surprise, since the float significand can hold values up to about 16 million). So if you do this calculation using float, I would expect that the rounding to float would produce that result (because generally speaking, float can represent about 5-7 decimal digits, and the deviation from that value does not begin until the 8th digit). Then, when we round that float result to half, we have an ambiguous case for “round to nearest”. The IEEE standard suggests that in this case, you use an additional rounding rule: “round to nearest even” This subsequent rounding rule prefers the value F816 over F817 in a tie.

2 Likes

My test is still running, but using the following code as a host-side reference has not produced a mismatch with device-side code yet:

half hfma_host (half a, half b, half c)
{
    return __double2half ((double)__half2float (a) * (double)__half2float (b) +
                          (double)__half2float (c));
}
3 Likes

First of all, I would like to thank everyone for the patient analysis and guidance. This is truly remarkable and has allowed me to deeply appreciate the value and charm of the CUDA open-source community.

As @Robert_Crovella analyzed, the float type scales the correct result to -33488.00 because float cannot accurately represent subsequent decimals, which further leads to a rounding effect in the “rounding to even (rtn)” mode, ultimately scaling the value to -33472 (as represented in half).

@njuffa’s suggestion was absolutely correct. After verifying the issue, I have managed to fix it. Initially, I attempted to compute the result using double, convert it back to float, and then to half, but this approach didn’t work. It now seems that precision was lost during the double to float conversion, which I had overlooked.

It is worth mentioning that due to multiple conversions on the host side, calculations sometimes deviate from those on the device side using the half type. This may occasionally lead to unexpected results (e.g., even double might fail to correctly round to half). A more robust long-term solution might be to allow for some error tolerance within a boundary, such as permitting -33488.00 ± e (where e is a small margin of error). This effectively allows the ULP difference between the host-side float result and the device-side half result to fall around 0.501.

Recently, I have been exploring CUDA’s math APIs, so I might encounter further issues and bring them to the community in the future.
Once again, thank you all for your help! My utmost respect to everyone!

Sorry, I am not following. Could you give an example? For addition, subtraction, multiplication, division, and square root, rounding a result achieved in the next higher IEEE format should result in bit-wise identical results in target precision, so if you emulate these on a per-operation basis on the host, the results should match those on the device (taking into account the use of canonical NaN on the device, if need be). There is a seminal paper that explains under which conditions double-rounding is harmless:

Figueroa, Samuel A. “When is double rounding innocuous?” ACM SIGNUM Newsletter 30.3 (1995): 21-26.

From memory, the gist of it is that if target precision uses p bits, and the first rounding is to a format with at least 2*p+2 bits, then double rounding is harmless for the enumerated operations. This condition holds for IEEEE-754 binary16, binary32, binary64, and binary128 formats, since 11*2+2 <= 24, 24*2+2 <= 53, and 53*2+2 <= 113.

Followup publications:

Martin-Dorel, Érik, Guillaume Melquiond, and Jean-Michel Muller. “Some issues related to double rounding.” BIT Numerical Mathematics 53.4 (2013): 897-924.

Roux, Pierre. “Innocuous double rounding of basic arithmetic operations.” Journal of Formalized Reasoning 7.1 (2014): 131-142.

Higham, Nicholas J., and Srikara Pranesh. “Simulating low precision floating-point arithmetic.” SIAM Journal on Scientific Computing 41.5 (2019): C585-C602.

If you are experiencing issues for these operations, check you host compiler settings to make sure it is not violating IEEE-754 semantics. You want /fp:strict on Windows and -ffp-model=strict on Linux (unless you are using gcc, in which case I don’t know what one should use).

2 Likes

I have revisited the topic and conducted extensive experimental scenarios and data testing. The solutions discussed above have consistently proven effective, and the results have always been satisfactory.

Please disregard the earlier paragraph—it was just a casual remark and not of any importance.

Wow, thank you for providing the literature! It’s exactly what I need, as I lack sufficient knowledge in this area.

Finally, thank you once again for your response. Your answer was excellent!

1 Like