Half2 performance

has anyone used half2 arithmetic successfully? I mean, the use of dedicated intrinsics for half2 for anything more than A*X+B doesn’t provide any acceleration of the calculations.

In this case every half2 contains 2 instances of half represented values:

device forceinline inline half2 xxh(half2 d)
const half2 A1 = __floats2half2_rn(0.31938153,0.31938153 );
const half2 A2 = __floats2half2_rn(-0.356563782,-0.356563782);

return __hmul2(d,h2exp( __hmul2(h2sqrt(__hmul2(h2sqrt(A2),A1)),d ) ) ) ;
// time : 0.982425 msec


device inline float xxf(float d)
const float A1 = 0.31938153f;
const float A2 = -0.356563782f;

return   d * exp(sqrt(A1*A1*A2)*d);;	

// time : 0.208930 msec

When I use basic hfma2 for half2 function:
-> return __hfma2(d,A2,d ) ;
The time for half2 : 0.131868 msec
in comparison to:

  • return d * A2 + d; for float the time eq 0.212713 msec.

How to overcome this problem?


The quoted performance numbers are meaningless without complete context. To what degree is the overall application-level performance dependent on memory throughput vs computational throughput? Note that computational throughput for “Half” operations differs much by hardware platform, so at minimum you would need to state what GPU you are using.

While transcendental functions can often be computed more efficiently at lower precision, that effect tends to rapidly diminish below single precision, especially if there is dedicated hardware acceleration for simple single-precision math functions as is the case on GPUs. The choices for half-precision math functions then often boil down to:

(1) Use the single-precision hardware. No speedup compared to single-precision computation. Small, straightforward code, very accurate half-precision results.

(2) Use discrete approximations specialized for half precision, but without dedicated hardware support. Could easily be slower than (1) due to requirement to guard against intermediate overflow and underflow (extremely narrow exponent range) and trying to preserve accuracy.

That probably explains your results when computing exp(sqrt(expr)). In my thinking, informed by relevant experience, arguments for adding hardware support for half2 computation are weak, outside of very narrowly defined circumstances:

(1) Explicit SIMD invariably interferes with compiler transformations, and even code generation by humans. The issues are often manageable for two SIMD lanes with some effort, but get progressively worse for four or eight SIMD lanes (as in SSE, AVX, AVX-512). Implicit SIMD, i.e. the GPU’s SIMT model, is a vastly superior approach, as code generation stays focused on scalar operations rather than vector operations.

(2) Use of half precision is an excellent tool for bandwidth reduction, and half-precision is suitable for much real-life source data due to the limited resolution of sensors measuring physical quantities (e.g. 10-bit resolution). It’s a pain in the behind for computation due to the significant danger of overflow and underflow in intermediate computations, and the fact that round-off errors can eat up a significant portion of the available 11 mantissa bits quickly. For many applications you would want at least 8 valid bits in the final output.

My baseline recommendation would be to use 16-bit half precision as a storage format in conjunction with scalar 32-bit float computation. Any deviations, in particular use of vectorized half-precision computation, should be carefully reasoned through and experimentally validated.

Thanks for the comment.
It’s GeForce GTX 1080 Ti , CUDA 9.2. How can I check if this version has dedicated hardware acceleration for the single precision type calculations?
I followed the assumptions presented in https://www.comp.nus.edu.sg/~wongwf/papers/hpec17.pdf and was wondering how to reduce the memory bandwidth in my computations to accelerate them. Other researchers mention acceleration in training neural networks thanks to the usage of the half type. Could you recommend any further readings?

fp16 on compute capability 6.1 devices is supported but at a relatively low rate, compared with FP32. This is covered in the relevant table in the programming guide:



It probably isn’t interesting from a compute performance or throughput consideration (on cc6.1 devices).

As njuffa said, it may still be interesting from a storage perspective, since it doubles the density of parameter storage (assuming parameters can be represented in FP16). This can have benefits both for storage size, as well as for read/write bandwidth utilization.

You can reduce memory bandwidth (as well as overall storage size) by loading/storing as half type:

__global__ void (half2 *data){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  half2 my_half2 = data[idx];  
  float2 my_float2 = __half22float2(my_half2);
  // perform calculations on float quantities here
  // results go back into my_float2 vector type
  my_half2 = __float22half2_rn(my_float2);
  data[idx] = my_half2;

All NVIDIA GPUs that have shipped for the past dozen years have included what is called a special function unit “SFU” that provides single-precision approximation hardware (that uses quadratic interpolation in HW tables), more recently called a multifunction unit “MUFU” as it also incorporates the texture interpolation hardware. It offers the following instructions:

MUFU.RSQ 1/sqrt(x)
MUFU.EX2 exp2(x)
MUFU.LG2 log2(x)
MUFU.SIN sin(x)
MUFU.COS cos(x)

Since the Pascal architecture, compute capability 6.0, the following is also provided:

MUFU.SQRT sqrt(x)

To accelerate double-precision reciprocal, division, and square root, the following are supported from (I think) the Kepler architecture, i.e. compute capability 3.0, on:

MUFU.RSQ64H // reciprocal square root on upper 32 bits of DP operand
MUFU.RCP64H // reciprocal on upper 32 bits of DP operand

I described a cursory exploration how SFU instructions can be used to implement half-precision transcendental functions in this post from 2016: