FP64 to FP16 Conversion: __double2half vs. __float2half(float(x))

Hi everyone,

I’m working on converting double-precision (FP64) values to half-precision (FP16) and I have a question about the best approach. Here are two methods I’m considering:

  1. Direct conversion: __double2half(x)
  2. Indirect conversion with intermediate float conversion: __float2half(float(x))

I’m unsure which method offers better performance and accuracy. Particularly, I’d like to understand:

  • Performance: Is one method generally faster than the other? Does it depend on the hardware?
  • Accuracy: Does either method introduce a potential loss of precision?

Specifically, regarding __double2half(x) : Does it perform an internal conversion to float before the conversion to FP16?

Additionally, how do these methods handle overflow and underflow situations (numbers outside the representable range of FP16)?

I’d appreciate any insights or experiences you might have with FP64 to FP16 conversion.

Thanks!

If you wish, you have the following avenues available to you:

Performance is something you can measure. Whether or not any of this makes a difference for application performance is something only you can measure, in your application.

The remainder of the questions can probably be addressed by studying the SASS code (using CUDA binary utilities) for representative examples of each case, or simply testing (for questions about out-of-range behavior).

Keep in mind that many things not specified in the function documentation (most of your questions, I expect) are probably implementation details, which means they are subject to change in the next CUDA version. Here is a simple example to address your bolded question:

# cat t221.cu
#include <cuda_fp16.h>

__host__ __device__ __half c(const double a) {return __double2half(a);}


# nvcc -dc t221.cu
# cuobjdump -sass ./t221.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit
compressed

        code for sm_52
                Function : _Z1cd
        .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM52 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM52)"
                                                                      /* 0x001c7c00fe8007f1 */
        /*0008*/                   MOV R6, R4 ;                       /* 0x5c98078000470006 */
        /*0010*/                   MOV R7, R5 ;                       /* 0x5c98078000570007 */
        /*0018*/                   F2F.F32.F64.RP R0, R6 ;            /* 0x5ca8010000670e00 */
                                                                      /* 0x001ffc01fda0071f */
        /*0028*/                   F2F.F32.F64.RM R3, R6 ;            /* 0x5ca8008000670e03 */
        /*0030*/                   FSETP.NE.AND P0, PT, R0, R3, PT ;  /* 0x5bb5038000370007 */
        /*0038*/                   MOV32I R3, 0x1 ;                   /* 0x010000000017f003 */
                                                                      /* 0x001fc801fec0071f */
        /*0048*/                   F2F.F32.F64.RZ R4, R6 ;            /* 0x5ca8018000670e04 */
        /*0050*/               @P0 BFI R5, R3, c[0x0][0x0], R4 ;      /* 0x4bf0020000000305 */
        /*0058*/                   SEL R4, R4, R5, !P0 ;              /* 0x5ca0040000570404 */
                                                                      /* 0x001ffc01ffe0071d */
        /*0068*/                   F2F.F16.F32 R4, R4 ;               /* 0x5ca8000000470904 */
        /*0070*/                   RET ;                              /* 0xe32000000007000f */
        /*0078*/                   BRA 0x78 ;                         /* 0xe2400fffff87000f */
                ..........



Fatbin ptx code:
================
arch = sm_52
code version = [8,2]
host = linux
compile_size = 64bit
compressed
ptxasOptions = --compile-only
# nvcc -dc t221.cu -arch=sm_89
# cuobjdump -sass ./t221.o

Fatbin elf code:
================
arch = sm_89
code version = [1,7]
host = linux
compile_size = 64bit
compressed

        code for sm_89
                Function : _Z1cd
        .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
        /*0000*/                   F2F.F16.F64 R4, R4 ;     /* 0x0000000400047310 */
                                                            /* 0x000e240000300800 */
        /*0010*/                   RET.ABS.NODEC R20 0x0 ;  /* 0x0000000014007950 */
                                                            /* 0x001fea0003e00000 */
        /*0020*/                   BRA 0x20;                /* 0xfffffff000007947 */
                                                            /* 0x000fc0000383ffff */
        /*0030*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0040*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0050*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0060*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0070*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0080*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*0090*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00a0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00b0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00c0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00d0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00e0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
        /*00f0*/                   NOP;                     /* 0x0000000000007918 */
                                                            /* 0x000fc00000000000 */
                ..........



Fatbin ptx code:
================
arch = sm_89
code version = [8,2]
host = linux
compile_size = 64bit
compressed
ptxasOptions = --compile-only
#

So we see that (for CUDA 12.2) in the cc5.2 case, a conversion is made from FP64 to FP32:

    /*0018*/                   F2F.F32.F64.RP R0, R6 ;            /* 0x5ca8010000670e00 */

then from FP32 to FP16:

    /*0068*/                   F2F.F16.F32 R4, R4 ;               /* 0x5ca8000000470904 */

It’s also evident that it is doing some work in the area of rounding, presumably to meet the “round to nearest even” spec given in the documentation.

In the cc8.9 case, it appears to be a direct conversion:

    /*0000*/                   F2F.F16.F64 R4, R4 ;     /* 0x0000000400047310 */

Just as a reminder, any of the above information could change in a different CUDA version.

To get an idea of out-of-range behavior, you could construct a test like this:

# cat t221.cu
#include <cuda_fp16.h>
#include <cstdio>
__host__ __device__ __half c(const double a) {return __double2half(a);}


__global__ void k(const double a){

  __half h = c(a);
  printf("%f\n", __half2float(h));
}

int main(){
  k<<<1,1>>>(1e64);
  cudaDeviceSynchronize();
}
# nvcc -o t221 t221.cu
# compute-sanitizer ./t221
========= COMPUTE-SANITIZER
inf
========= ERROR SUMMARY: 0 errors
#

For this last test case, my guess would be that the observed behavior is dictated by C++ expectations for what to do with a result when it is not representable in the provided type. I’m using “representable” somewhat loosely here. The actual behavior above is “out of range”. Even that description needs to be handled carefully, according to C++ terminology, not necessarily math terminology.