Half2 performance

anon3164962 · October 25, 2018, 7:24pm

Hi,
has anyone used half2 arithmetic successfully? I mean, the use of dedicated intrinsics for half2 for anything more than A*X+B doesn’t provide any acceleration of the calculations.

In this case every half2 contains 2 instances of half represented values:

HALF2
device forceinline inline half2 xxh(half2 d)
{
const half2 A1 = __floats2half2_rn(0.31938153,0.31938153 );
const half2 A2 = __floats2half2_rn(-0.356563782,-0.356563782);

return __hmul2(d,h2exp( __hmul2(h2sqrt(__hmul2(h2sqrt(A2),A1)),d ) ) ) ;
// time : 0.982425 msec

}

FLOAT
device inline float xxf(float d)
{
const float A1 = 0.31938153f;
const float A2 = -0.356563782f;

return   d * exp(sqrt(A1*A1*A2)*d);;

// time : 0.208930 msec
}

When I use basic hfma2 for half2 function:
→ return __hfma2(d,A2,d ) ;
The time for half2 : 0.131868 msec
in comparison to:

return d * A2 + d; for float the time eq 0.212713 msec.

How to overcome this problem?

Thanks,
Jack

njuffa · October 25, 2018, 8:17pm

The quoted performance numbers are meaningless without complete context. To what degree is the overall application-level performance dependent on memory throughput vs computational throughput? Note that computational throughput for “Half” operations differs much by hardware platform, so at minimum you would need to state what GPU you are using.

While transcendental functions can often be computed more efficiently at lower precision, that effect tends to rapidly diminish below single precision, especially if there is dedicated hardware acceleration for simple single-precision math functions as is the case on GPUs. The choices for half-precision math functions then often boil down to:

(1) Use the single-precision hardware. No speedup compared to single-precision computation. Small, straightforward code, very accurate half-precision results.

(2) Use discrete approximations specialized for half precision, but without dedicated hardware support. Could easily be slower than (1) due to requirement to guard against intermediate overflow and underflow (extremely narrow exponent range) and trying to preserve accuracy.

That probably explains your results when computing exp(sqrt(expr)). In my thinking, informed by relevant experience, arguments for adding hardware support for half2 computation are weak, outside of very narrowly defined circumstances:

(1) Explicit SIMD invariably interferes with compiler transformations, and even code generation by humans. The issues are often manageable for two SIMD lanes with some effort, but get progressively worse for four or eight SIMD lanes (as in SSE, AVX, AVX-512). Implicit SIMD, i.e. the GPU’s SIMT model, is a vastly superior approach, as code generation stays focused on scalar operations rather than vector operations.

(2) Use of half precision is an excellent tool for bandwidth reduction, and half-precision is suitable for much real-life source data due to the limited resolution of sensors measuring physical quantities (e.g. 10-bit resolution). It’s a pain in the behind for computation due to the significant danger of overflow and underflow in intermediate computations, and the fact that round-off errors can eat up a significant portion of the available 11 mantissa bits quickly. For many applications you would want at least 8 valid bits in the final output.

My baseline recommendation would be to use 16-bit half precision as a storage format in conjunction with scalar 32-bit float computation. Any deviations, in particular use of vectorized half-precision computation, should be carefully reasoned through and experimentally validated.

anon3164962 · October 29, 2018, 8:29pm

Hi,
Thanks for the comment.
It’s GeForce GTX 1080 Ti , CUDA 9.2. How can I check if this version has dedicated hardware acceleration for the single precision type calculations?
I followed the assumptions presented in https://www.comp.nus.edu.sg/~wongwf/papers/hpec17.pdf and was wondering how to reduce the memory bandwidth in my computations to accelerate them. Other researchers mention acceleration in training neural networks thanks to the usage of the half type. Could you recommend any further readings?

Robert_Crovella · October 29, 2018, 8:47pm

fp16 on compute capability 6.1 devices is supported but at a relatively low rate, compared with FP32. This is covered in the relevant table in the programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/

It probably isn’t interesting from a compute performance or throughput consideration (on cc6.1 devices).

As njuffa said, it may still be interesting from a storage perspective, since it doubles the density of parameter storage (assuming parameters can be represented in FP16). This can have benefits both for storage size, as well as for read/write bandwidth utilization.

You can reduce memory bandwidth (as well as overall storage size) by loading/storing as half type:

__global__ void (half2 *data){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  half2 my_half2 = data[idx];  
  float2 my_float2 = __half22float2(my_half2);
  // perform calculations on float quantities here
  // results go back into my_float2 vector type
  my_half2 = __float22half2_rn(my_float2);
  data[idx] = my_half2;
}

njuffa · October 29, 2018, 10:02pm

All NVIDIA GPUs that have shipped for the past dozen years have included what is called a special function unit “SFU” that provides single-precision approximation hardware (that uses quadratic interpolation in HW tables), more recently called a multifunction unit “MUFU” as it also incorporates the texture interpolation hardware. It offers the following instructions:

MUFU.RCP 1/x
MUFU.RSQ 1/sqrt(x)
MUFU.EX2 exp2(x)
MUFU.LG2 log2(x)
MUFU.SIN sin(x)
MUFU.COS cos(x)

Since the Pascal architecture, compute capability 6.0, the following is also provided:

MUFU.SQRT sqrt(x)

To accelerate double-precision reciprocal, division, and square root, the following are supported from (I think) the Kepler architecture, i.e. compute capability 3.0, on:

MUFU.RSQ64H // reciprocal square root on upper 32 bits of DP operand
MUFU.RCP64H // reciprocal on upper 32 bits of DP operand

I described a cursory exploration how SFU instructions can be used to implement half-precision transcendental functions in this post from 2016:

https://devtalk.nvidia.com/default/topic/982827/on-the-utility-of-sfu-instructions-for-half-precision-math-functions/

Topic		Replies	Views
Almost no performance improvement with half2 Legacy PGI Compilers	4	2064	April 9, 2020
Poor half performance CUDA Programming and Performance	13	2584	June 19, 2025
error when trying to use half (fp16) CUDA Programming and Performance	16	20464	October 13, 2015
TX2 with FP16 Running Slower than FP32 Jetson TX2	22	4480	October 18, 2021
Half2 vs. Half datatype CUDA Programming and Performance	3	3742	July 1, 2022
h2sin performance CUDA Programming and Performance	24	2191	July 16, 2019
Any plans to make __Half & __Half2 a single built in type with NVCC? CUDA Programming and Performance	0	382	October 3, 2019
On the utility of SFU instructions for half-precision math functions CUDA Programming and Performance	8	2676	September 16, 2019
cuSPARSE Matrix-Vector multiplication in half precision runs slower compared to single precision CUDA Programming and Performance cuda	6	874	January 26, 2021
Half performance on a100 CUDA Programming and Performance	0	539	January 13, 2021

Half2 performance

Related topics