Precision is be influenced when adopting the __half(fp16) dataType

lylyly6666 · July 6, 2023, 12:23pm

Following is the code:

__global__ void warmup(){
    
    __half tmp = 3;
    __half b = 1;
    __half wacc = 0;
    for(int i = 0; i < 75; i++) {
        for(int j = 0; j < 16; j++) {
            printf("index: %d, before: %f ", i, __half2float(wacc));
            wacc = __hadd(wacc, __hmul(tmp, b));
            printf("after: %f\n ", __half2float(wacc));
        }
    }
}

I expect the result to be 3600, however the fllowing is the output:

 index: 74, before: 4072.000000 after: 4076.000000
 index: 74, before: 4076.000000 after: 4080.000000
 index: 74, before: 4080.000000 after: 4084.000000
 index: 74, before: 4084.000000 after: 4088.000000
 index: 74, before: 4088.000000 after: 4092.000000
 index: 74, before: 4092.000000 after: 4096.000000
 index: 74, before: 4096.000000 after: 4100.000000
 index: 74, before: 4100.000000 after: 4104.000000
 index: 74, before: 4104.000000 after: 4108.000000
 index: 74, before: 4108.000000 after: 4112.000000
 index: 74, before: 4112.000000 after: 4116.000000

so I guess the problem may have to do with the function __hadd or __hmul, the error is accumulated in the for cycle.

I also tried the __hfma: wacc = __hfma(tmp, b, wacc), replace the code:

wacc = __hadd(wacc, __hmul(tmp, b));

but the output is same as above.

so what can i do if I want to get the result: 3600 instead of the output above when I use the __half datatype? I really need someone’s help!

Robert_Crovella · July 6, 2023, 2:39pm

An FP16 (half) number has 10 mantissa bits or 11 significand bits. That is about enough to store 2-3 decimal digits of resolution. When the numbers you are adding together differ by more than 2-3 decimal digits or decimal orders of magnitude (as they do when you are adding 3 to 3600) then you are not going to get the results you expect. The half format is not suitable (IMO) for such calculations.

One approach would be to switch to FP32. You will run into a similar problem with FP32 numbers if/when you try to add two numbers that differ by more than about 6-7 decimal digits (decimal orders of magnitude).

To use FP16, its even more important (than FP32) to pay attention to scaling of numbers used in the calculations, relative to each other. One “possible method” to get your result to end up at 3600 is to make sure that the value you are adding each time is “within range” of 3600. Rather than adding 3 each time, we could meet this need, for this particular case, by adding 30 each time.

Here is an example:

$ cat t5.cu
#include <cstdio>

#include <cuda_fp16.h>
#include <math.h>

__global__ void warmup(){

    __half tmp = 3;
    __half b = 10;
    __half wacc = 0;
    for(int i = 0; i < 15; i++) {
        for(int j = 0; j < 8; j++) {
            if (i > 13) printf("index: %d, before: %f ", i, __half2float(wacc));
            wacc = __hadd(wacc, __hmul(tmp, b));
            if (i > 13) printf("after: %f\n ", __half2float(wacc));
        }
    }
}

int main(){

    warmup<<<1,1>>>();
    cudaDeviceSynchronize();
}
$ nvcc -o t5 t5.cu -arch=sm_60
$ ./t5
index: 14, before: 3360.000000 after: 3390.000000
 index: 14, before: 3390.000000 after: 3420.000000
 index: 14, before: 3420.000000 after: 3450.000000
 index: 14, before: 3450.000000 after: 3480.000000
 index: 14, before: 3480.000000 after: 3510.000000
 index: 14, before: 3510.000000 after: 3540.000000
 index: 14, before: 3540.000000 after: 3570.000000
 index: 14, before: 3570.000000 after: 3600.000000
$

Topic		Replies	Views
__hadd not working correctly CUDA Programming and Performance cuda	3	418	October 19, 2023
How to cuda half and half functions CUDA Programming and Performance	5	4177	January 10, 2019
half calculation generates incorrect result CUDA Programming and Performance	1	585	October 19, 2019
Benchmark numbers wrong for half math on 1060 CUDA Programming and Performance	9	1347	January 23, 2017
Two expressions of same mathematical semantic give different results CUDA Programming and Performance	4	371	July 6, 2023
When using the __hfma function, the Device and Host results differ by 1 ULP, which should be 0ulp in theory CUDA Programming and Performance cuda , kernel	7	313	November 28, 2024
Poor half performance CUDA Programming and Performance	13	2660	June 19, 2025
fp16 vs fp32 CUDA Programming and Performance	3	4298	November 13, 2017
FP16 add Arithmetic Function Variety CUDA NVCC Compiler	1	714	July 1, 2022
Problem manipulating half precision variables in CUDA kernel Jetson TX2	3	1579	November 28, 2017

Precision is be influenced when adopting the __half(fp16) dataType

Related topics