FP64 to FP16 Conversion: double2half vs. float2half(float(x))

uniadam · July 6, 2024, 1:41pm

Hi everyone,

I’m working on converting double-precision (FP64) values to half-precision (FP16) and I have a question about the best approach. Here are two methods I’m considering:

Direct conversion: __double2half(x)
Indirect conversion with intermediate float conversion: __float2half(float(x))

I’m unsure which method offers better performance and accuracy. Particularly, I’d like to understand:

Performance: Is one method generally faster than the other? Does it depend on the hardware?
Accuracy: Does either method introduce a potential loss of precision?

Specifically, regarding __double2half(x) : Does it perform an internal conversion to float before the conversion to FP16?

Additionally, how do these methods handle overflow and underflow situations (numbers outside the representable range of FP16)?

I’d appreciate any insights or experiences you might have with FP64 to FP16 conversion.

Thanks!

Robert_Crovella · July 6, 2024, 2:34pm

If you wish, you have the following avenues available to you:

Performance is something you can measure. Whether or not any of this makes a difference for application performance is something only you can measure, in your application.

The remainder of the questions can probably be addressed by studying the SASS code (using CUDA binary utilities) for representative examples of each case, or simply testing (for questions about out-of-range behavior).

Keep in mind that many things not specified in the function documentation (most of your questions, I expect) are probably implementation details, which means they are subject to change in the next CUDA version. Here is a simple example to address your bolded question:

# cat t221.cu
#include <cuda_fp16.h>

__host__ __device__ __half c(const double a) {return __double2half(a);}


# nvcc -dc t221.cu
# cuobjdump -sass ./t221.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit
compressed

        code for sm_52
                Function : _Z1cd
        .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM52 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM52)"
                                                                      /* 0x001c7c00fe8007f1 */
        /*0008*/                   MOV R6, R4 ;                       /* 0x5c98078000470006 */
        /*0010*/                   MOV R7, R5 ;                       /* 0x5c98078000570007 */
        /*0018*/                   F2F.F32.F64.RP R0, R6 ;            /* 0x5ca8010000670e00 */
                                                                      /* 0x001ffc01fda0071f */
        /*0028*/                   F2F.F32.F64.RM R3, R6 ;            /* 0x5ca8008000670e03 */
        /*0030*/                   FSETP.NE.AND P0, PT, R0, R3, PT ;  /* 0x5bb5038000370007 */
        /*0038*/                   MOV32I R3, 0x1 ;                   /* 0x010000000017f003 */
                                                                      /* 0x001fc801fec0071f */
        /*0048*/                   F2F.F32.F64.RZ R4, R6 ;            /* 0x5ca8018000670e04 */
        /*0050*/               @P0 BFI R5, R3, c[0x0][0x0], R4 ;      /* 0x4bf0020000000305 */
        /*0058*/                   SEL R4, R4, R5, !P0 ;              /* 0x5ca0040000570404 */
                                                                      /* 0x001ffc01ffe0071d */
        /*0068*/                   F2F.F16.F32 R4, R4 ;               /* 0x5ca8000000470904 */
        /*0070*/                   RET ;                              /* 0xe32000000007000f */
        /*0078*/                   BRA 0x78 ;                         /* 0xe2400fffff87000f */
                ..........



Fatbin ptx code:
================
arch = sm_52
code version = [8,2]
host = linux
compile_size = 64bit
compressed
ptxasOptions = --compile-only
# nvcc -dc t221.cu -arch=sm_89
# cuobjdump -sass ./t221.o

Fatbin elf code:
================
arch = sm_89
code version = [1,7]
host = linux
compile_size = 64bit
compressed

        code for sm_89
                Function : _Z1cd
        .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
        /*0000*/                   F2F.F16.F64 R4, R4 ;     /* 0x0000000400047310 */
                                                            /* 0x000e240000300800 */
        /*0010*/                   RET.ABS.NODEC R20 0x0 ;  /* 0x0000000014007950 */
                                                            /* 0x001fea0003e00000 */
        /*0020*/                   BRA 0x20;                /* 0xfffffff000007947 */
                                                            /* 0x000fc0000383ffff */
        /*0030*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0040*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0050*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0060*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0070*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0080*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0090*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00a0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00b0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00c0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00d0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00e0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00f0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
                ..........



Fatbin ptx code:
================
arch = sm_89
code version = [8,2]
host = linux
compile_size = 64bit
compressed
ptxasOptions = --compile-only
#

So we see that (for CUDA 12.2) in the cc5.2 case, a conversion is made from FP64 to FP32:

    /*0018*/                   F2F.F32.F64.RP R0, R6 ;            /* 0x5ca8010000670e00 */

then from FP32 to FP16:

    /*0068*/                   F2F.F16.F32 R4, R4 ;               /* 0x5ca8000000470904 */

It’s also evident that it is doing some work in the area of rounding, presumably to meet the “round to nearest even” spec given in the documentation.

In the cc8.9 case, it appears to be a direct conversion:

    /*0000*/                   F2F.F16.F64 R4, R4 ;     /* 0x0000000400047310 */

Just as a reminder, any of the above information could change in a different CUDA version.

To get an idea of out-of-range behavior, you could construct a test like this:

# cat t221.cu
#include <cuda_fp16.h>
#include <cstdio>
__host__ __device__ __half c(const double a) {return __double2half(a);}


__global__ void k(const double a){

  __half h = c(a);
  printf("%f\n", __half2float(h));
}

int main(){
  k<<<1,1>>>(1e64);
  cudaDeviceSynchronize();
}
# nvcc -o t221 t221.cu
# compute-sanitizer ./t221
========= COMPUTE-SANITIZER
inf
========= ERROR SUMMARY: 0 errors
#

For this last test case, my guess would be that the observed behavior is dictated by C++ expectations for what to do with a result when it is not representable in the provided type. I’m using “representable” somewhat loosely here. The actual behavior above is “out of range”. Even that description needs to be handled carefully, according to C++ terminology, not necessarily math terminology.

Topic		Replies	Views
error when trying to use half (fp16) CUDA Programming and Performance	16	19928	October 13, 2015
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1898	October 28, 2016
Compile float as 64bit floating point CUDA Programming and Performance	7	1493	September 25, 2016
fp16 vs fp32 CUDA Programming and Performance	3	3893	November 13, 2017
FFT Speed vs. x86 CUDA Programming and Performance	14	24707	July 27, 2008
Half precision cuFFT Transforms GPU-Accelerated Libraries	12	6025	March 29, 2021
TX2 with FP16 Running Slower than FP32 Jetson TX2	22	4213	October 18, 2021
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	91	January 19, 2025
Half2 performance CUDA Programming and Performance	4	2599	October 29, 2018
accuracy of CUFFT under double precision CUDA Programming and Performance	9	4144	September 18, 2009

FP64 to FP16 Conversion: __double2half vs. __float2half(float(x))

Related topics

FP64 to FP16 Conversion: double2half vs. float2half(float(x))