The quoted performance numbers are meaningless without complete context. To what degree is the overall application-level performance dependent on memory throughput vs computational throughput? Note that computational throughput for “Half” operations differs much by hardware platform, so at minimum you would need to state what GPU you are using.

While transcendental functions can often be computed more efficiently at lower precision, that effect tends to rapidly diminish below single precision, especially if there is dedicated hardware acceleration for simple single-precision math functions as is the case on GPUs. The choices for half-precision math functions then often boil down to:

(1) Use the single-precision hardware. No speedup compared to single-precision computation. Small, straightforward code, very accurate half-precision results.

(2) Use discrete approximations specialized for half precision, but without dedicated hardware support. Could easily be slower than (1) due to requirement to guard against intermediate overflow and underflow (extremely narrow exponent range) and trying to preserve accuracy.

That probably explains your results when computing exp(sqrt(expr)). In my thinking, informed by relevant experience, arguments for adding hardware support for half2 computation are weak, outside of very narrowly defined circumstances:

(1) Explicit SIMD invariably interferes with compiler transformations, and even code generation by humans. The issues are often manageable for two SIMD lanes with some effort, but get progressively worse for four or eight SIMD lanes (as in SSE, AVX, AVX-512). Implicit SIMD, i.e. the GPU’s SIMT model, is a vastly superior approach, as code generation stays focused on scalar operations rather than vector operations.

(2) Use of half precision is an excellent tool for bandwidth reduction, and half-precision is suitable for much real-life source data due to the limited resolution of sensors measuring physical quantities (e.g. 10-bit resolution). It’s a pain in the behind for computation due to the significant danger of overflow and underflow in intermediate computations, and the fact that round-off errors can eat up a significant portion of the available 11 mantissa bits quickly. For many applications you would want at least 8 valid bits in the final output.

My baseline recommendation would be to use 16-bit half precision as a storage format in conjunction with scalar 32-bit float computation. Any deviations, in particular use of vectorized half-precision computation, should be carefully reasoned through and experimentally validated.