converting fp32 math to fp16 fails to give speed up

FangQ · November 20, 2017, 3:03pm

We wanted to take advantage of the half-precision (fp16) throughput on P100 and converted portion of a math-heavy kernel from single-precision to half-precision. after running the code on dgx-1, we found that, instead of getting a speed bump, we got a speed drop :(

the core of my code is basically a ray-tracer in the voxelated space. here is the patch with fp16 math

in comparison, here is the code for the fp32 computation

github.com

fangq/mcx/blob/14bb584fd2d2672bb3718471a0fde94a31284bd6/src/mcx_core.cu#L159-L195


      
          __device__ inline float mcx_nextafterf(float a, int dir){
                union{
                    float f;
          	  uint  i;
                } num;
                num.f=a+gcfg->maxvoidstep;
                num.i+=dir ^ (num.i & 0x80000000U);
                return num.f-gcfg->maxvoidstep;
          }
          
          
#ifndef USE_HALF
          
          
__device__ inline float hitgrid(float3 *p0, float3 *v, float *htime,float* rv,int *id){
                float dist;
          
          
      //time-of-flight to hit the wall in each direction
                htime[0]=fabs((floorf(p0->x)+(v->x>0.f)-p0->x)*rv[0]); // absolute distance of travel in x/y/z
                htime[1]=fabs((floorf(p0->y)+(v->y>0.f)-p0->y)*rv[1]);
                htime[2]=fabs((floorf(p0->z)+(v->z>0.f)-p0->z)*rv[2]);

This file has been truncated. show original

on the dgx1 (P100), it dropped the speed for about 18%.

any comments on what might be wrong with this implementation? any best practices guidelines for using fp16?

to reproduce this, you can run the following commands (you need to have cuda 8/9):

git clone https://github.com/fangq/mcx.git
cd mcx/src
make half  # type "make" alone will create fp32 code
cd ../example/benchmark/
./run_benchmark1.sh

on Tesla P100-SXM2-16GB, we got 40128.41 photon/ms with half precision, in comparison, single-precision code (can be created using “make” instead of “make half”) gives 48402.71 photon/ms

Robert_Crovella · November 20, 2017, 3:29pm

at most, you will get a 2x speedup from the use of half2 FMA over float FMA. From that, subtract everything you are doing to convert float to half and then half back to float.

njuffa · November 20, 2017, 7:44pm

What txbob says, plus the overhead of having to emulate operations in explicit-SIMD that do not have a “native” equivalent. That is the bane of explicit SIMD, and has affected MMX, SSE, and AVX as well, causing multiple waves of new instructions to be added in the case of SSE and AVX to try and remedy that.

I consider explicit SIMD inherently (and fatally) flawed because of that, and the GPU’s traditional implicit SIMD (a.k.a SIMT) vastly superior. The only thing explicit SIMD has going for it is hardware simplicity, but it comes at a significant cost of reduced programmer productivity.

FangQ · November 21, 2017, 12:13am

thanks for both of your comments.

I suppose, if I can find all half2 native equivalent functions, the less number of conversions the better the performance should be. in other words, convert once, use many times to even out the overhead, is this what you meant?

njuffa · November 21, 2017, 12:19am

Minimizing the overall number of type conversion operations required is strongly advised. Moving around data as ‘half2’ instead of two separate ‘float’ values will also have a beneficial effect on memory and register bandwidth used.

In some sense this is the analog of avoiding data movement between host memory and GPU memory when applying GPU acceleration to an application (instead, keep the data resident on the GPU).

Robert_Crovella · November 21, 2017, 12:24am

Yes.

Even better:
Run your algorithms entirely with half2 data. Store half2. load half2. Compute with half2. Even then you’re not likely to witness the full 2x theoretical speedup, but you may come a lot closer. An additional benefit of this approach is that you may get additional lift from the storage and memory bandwidth improvements that come with loading/storing twice as many parameters/elements per transaction.

Topic		Replies	Views
fp16 vs fp32 CUDA Programming and Performance	3	3893	November 13, 2017
Fp8 conversion performance makes it slower than float16 CUDA Programming and Performance	7	53	March 3, 2025
Half2 performance CUDA Programming and Performance	4	2599	October 29, 2018
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1816	October 12, 2021
Unexpectedly low performance of cuFFT with half floating point (FP16) GPU-Accelerated Libraries	1	1654	June 16, 2017
error when trying to use half (fp16) CUDA Programming and Performance	16	19928	October 13, 2015
Type conversion throughput/latency CUDA Programming and Performance	5	490	February 3, 2024
FP64 to FP16 Conversion: __double2half vs. __float2half(float(x)) GPU-Accelerated Libraries	1	137	July 6, 2024
TX2 with FP16 Running Slower than FP32 Jetson TX2	22	4213	October 18, 2021
Convert FP32 to FP16 by CPU and Transfer FP16 Copy to GPU CUDA Programming and Performance	6	2070	August 10, 2022

converting fp32 math to fp16 fails to give speed up

Related topics