converting fp32 math to fp16 fails to give speed up

We wanted to take advantage of the half-precision (fp16) throughput on P100 and converted portion of a math-heavy kernel from single-precision to half-precision. after running the code on dgx-1, we found that, instead of getting a speed bump, we got a speed drop :(

the core of my code is basically a ray-tracer in the voxelated space. here is the patch with fp16 math

https://github.com/fangq/mcx/commit/14bb584fd2d2672bb3718471a0fde94a31284bd6

in comparison, here is the code for the fp32 computation

https://github.com/fangq/mcx/blob/14bb584fd2d2672bb3718471a0fde94a31284bd6/src/mcx_core.cu#L159-L195

on the dgx1 (P100), it dropped the speed for about 18%.

any comments on what might be wrong with this implementation? any best practices guidelines for using fp16?


to reproduce this, you can run the following commands (you need to have cuda 8/9):

git clone https://github.com/fangq/mcx.git
cd mcx/src
make half  # type "make" alone will create fp32 code
cd ../example/benchmark/
./run_benchmark1.sh

on Tesla P100-SXM2-16GB, we got 40128.41 photon/ms with half precision, in comparison, single-precision code (can be created using “make” instead of “make half”) gives 48402.71 photon/ms

at most, you will get a 2x speedup from the use of half2 FMA over float FMA. From that, subtract everything you are doing to convert float to half and then half back to float.

What txbob says, plus the overhead of having to emulate operations in explicit-SIMD that do not have a “native” equivalent. That is the bane of explicit SIMD, and has affected MMX, SSE, and AVX as well, causing multiple waves of new instructions to be added in the case of SSE and AVX to try and remedy that.

I consider explicit SIMD inherently (and fatally) flawed because of that, and the GPU’s traditional implicit SIMD (a.k.a SIMT) vastly superior. The only thing explicit SIMD has going for it is hardware simplicity, but it comes at a significant cost of reduced programmer productivity.

thanks for both of your comments.

I suppose, if I can find all half2 native equivalent functions, the less number of conversions the better the performance should be. in other words, convert once, use many times to even out the overhead, is this what you meant?

Minimizing the overall number of type conversion operations required is strongly advised. Moving around data as ‘half2’ instead of two separate ‘float’ values will also have a beneficial effect on memory and register bandwidth used.

In some sense this is the analog of avoiding data movement between host memory and GPU memory when applying GPU acceleration to an application (instead, keep the data resident on the GPU).

Yes.

Even better:
Run your algorithms entirely with half2 data. Store half2. load half2. Compute with half2. Even then you’re not likely to witness the full 2x theoretical speedup, but you may come a lot closer. An additional benefit of this approach is that you may get additional lift from the storage and memory bandwidth improvements that come with loading/storing twice as many parameters/elements per transaction.