h2sin performance

Hello,

I want to migrate to mixed precision a program in which I need to compute a huge number of sin functions (actually, what I need are sinc functions, but, as there is no native implementation, I compute it using sin functions). I expected my half2 implementation to be 2x faster, but it is even slower.

Dealing with arithmetic functions, such as __hadd2 or __hmul2, I get the desired x2 gain factor, however, using nvprof over the following test program:

#include <stdio.h>
#include <cuda_fp16.h>

#define Fs 16000.0
#define h2Fs __floats2half2_rn(Fs,Fs)

__global__
void generate_time_mp_kernel(int n, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2) t[i] = __floats2half2_rn(2.0*n/Fs, (2.0*n+1)/Fs);
}

__global__
void sin_mp_kernel(int n, half2 *x, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2) x[i] = h2sin(t[n]);
}

int main(void)
{
  int N = 1<<18; // 1<<20 = 1048576
  half2 *x, *t, *d_x, *d_t;
  t = (half2*)malloc(N/2*sizeof(half2));
  x = (half2*)malloc(N/2*sizeof(half2));

  cudaMalloc(&d_t, N/2*sizeof(half2));
  cudaMalloc(&d_x, N/2*sizeof(half2)); 

  generate_time_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_t);
  sin_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_x, d_t);

  cudaFree(d_x);
  cudaFree(d_t);
  free(x);
  free(t);
}

the time spent in sin_mp_kernel is quite similar to the time spent in sin_kernel for:

#include <stdio.h>

#define Fs 16000.0

__global__
void create_time_kernel(int n, float *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) t[i] = n / Fs;
}

__global__
void sin_kernel(int n, float *x, float *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) x[i] = sin(t[n]);
}

int main(void)
{
  int N = 1<<18; // 1<<20 = 1048576
  float *x, *t, *d_x, *d_t;
  t = (float*)malloc(N*sizeof(float));
  x = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_t, N*sizeof(float));
  cudaMalloc(&d_x, N*sizeof(float)); 

  generate_time_kernel<<<(N+255)/256, 256>>>(N, d_t);
  sin_kernel<<<(N+255)/256, 256>>>(N, d_x, d_t);

  cudaFree(d_x);
  cudaFree(d_t);
  free(x);
  free(t);
}

Is this the expected behaviour or am I doing something worng? Is h2sin(half2) as fast as sin(float) while computing 2 sins (as __hadd2 or __hmul2 are) or it is slower?

Thanks in advance,
David

By the way, I am using a Tesla P100 on a Google Cloud Instance with CUDA 10.

I wouldn’t expect the calculation of two sin quantities to be as fast as or faster than the calculation of one sin quantity.

I think its possible that the calculation of two 16-bit floating point sin quantities might be faster than the calculation of two 32-bit floating point sin quantities. This would be the general premise of mixed-precision performance, on processors that support full-rate FP16 calculation, like P100.

Also note that your codes are reading from locations (t[n]) that you haven’t initialized. (In fact, t[n] looks like out-of-bounds access to me. It’s usually good practice to run your code with cuda-memcheck, and also make sure you are compiling without -G switch, before doing any perf analysis. In fact I would normally suggest verifying for correct results before performing performance analysis, but there may be some argument there.) The sin function may have various execution paths, depending on the argument, which could lead to variability there.

If you run these codes with cuda-memcheck, you will discover all sorts of out-of-bounds accesses going on.

Thank you for your comments!

It was true that the test program had some errors (it was a quick test and I did not check it as much as it should). In case it could be useful for someone in the future, I leave you here the fixed programs and the profiling results.

Single precision

#include <stdio.h>

#define Fs 16000.0

__global__
void create_time_kernel(int n, float *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) t[i] = i / Fs;
}

__global__
void sin_kernel(int n, float *x, float *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) x[i] = sin(t[i]);
}

int main(void)
{
  int N = 1<<20; // 1<<20 = 1048576
  float *x, *t, *d_x, *d_t;
  t = (float*)malloc(N*sizeof(float));
  x = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_t, N*sizeof(float));
  cudaMalloc(&d_x, N*sizeof(float)); 

  create_time_kernel<<<(N+255)/256, 256>>>(N, d_t);
  sin_kernel<<<(N+255)/256, 256>>>(N, d_x, d_t);

  cudaFree(d_x);
  cudaFree(d_t);
  free(x);
  free(t);
}

Profiling results:

.           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.05%  80.960us         1  80.960us  80.960us  80.960us  sin_kernel(int, float*, float*)
                   47.95%  74.592us         1  74.592us  74.592us  74.592us  create_time_kernel(int, float*)
      API calls:   99.42%  193.48ms         2  96.742ms  220.87us  193.26ms  cudaMalloc
                    0.28%  551.77us         2  275.89us  234.48us  317.29us  cudaFree
                    0.13%  247.15us         1  247.15us  247.15us  247.15us  cuDeviceTotalMem
                    0.12%  238.99us        96  2.4890us     127ns  88.185us  cuDeviceGetAttribute
                    0.03%  57.393us         2  28.696us  12.413us  44.980us  cudaLaunchKernel
                    0.01%  21.620us         1  21.620us  21.620us  21.620us  cuDeviceGetName
                    0.00%  3.2510us         3  1.0830us     180ns  2.3980us  cuDeviceGetCount
                    0.00%  2.9730us         1  2.9730us  2.9730us  2.9730us  cuDeviceGetPCIBusId
                    0.00%  1.1990us         2     599ns     242ns     957ns  cuDeviceGet
                    0.00%     257ns         1     257ns     257ns     257ns  cuDeviceGetUuid

Half precision

#include <stdio.h>
#include <cuda_fp16.h>

#define Fs 16000.0
#define h2Fs __floats2half2_rn(Fs,Fs)

__global__
void generate_time_mp_kernel(int n, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2) t[i] = __floats2half2_rn(2.0*i/Fs, (2.0*i+1)/Fs);
}

__global__
void sin_mp_kernel(int n, half2 *x, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2) x[i] = h2sin(t[i]);
}

int main(void)
{
  int N = 1<<20; // 1<<20 = 1048576
  half2 *x, *t, *d_x, *d_t;
  t = (half2*)malloc(N/2*sizeof(half2));
  x = (half2*)malloc(N/2*sizeof(half2));

  cudaMalloc(&d_t, N/2*sizeof(half2));
  cudaMalloc(&d_x, N/2*sizeof(half2)); 

  generate_time_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_t);
  sin_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_x, d_t);

  cudaFree(d_x);
  cudaFree(d_t);
  free(x);
  free(t);
}

Profiling results:

.           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.89%  64.352us         1  64.352us  64.352us  64.352us  sin_mp_kernel(int, __half2*, __half2*)
                   47.11%  57.312us         1  57.312us  57.312us  57.312us  generate_time_mp_kernel(int, __half2*)
      API calls:   99.47%  187.56ms         2  93.779ms  132.52us  187.43ms  cudaMalloc
                    0.18%  346.98us         2  173.49us  118.03us  228.96us  cudaFree
                    0.16%  301.37us        96  3.1390us     126ns  96.400us  cuDeviceGetAttribute
                    0.14%  258.79us         1  258.79us  258.79us  258.79us  cuDeviceTotalMem
                    0.03%  51.175us         2  25.587us  11.229us  39.946us  cudaLaunchKernel
                    0.02%  37.970us         1  37.970us  37.970us  37.970us  cuDeviceGetName
                    0.00%  3.2700us         1  3.2700us  3.2700us  3.2700us  cuDeviceGetPCIBusId
                    0.00%  2.2650us         3     755ns     154ns  1.0660us  cuDeviceGetCount
                    0.00%  1.1030us         2     551ns     307ns     796ns  cuDeviceGet
                    0.00%     267ns         1     267ns     267ns     267ns  cuDeviceGetUuid

The h2sin(half2) seems to spent about the 80% of the time that you need to compute two sine functions with sin(float) in two kernels. It is worth to say that the precision of t is quite low at its end for the half2 case, but I do not think it should affect to the performance.

I would expect both sin_kernel() and sin_mp_kernel() to be almost completely memory bound (each estimated to use about 20 to 25 instructions in the critical path vs 8 bytes accessed), so it does not look like you are measuring what you desire to measure, namely compute throughput. I would suggest increasing the computational complexity by a factor of ten, for example by computing sin(sin(sin( … x)…))) prior to storing out the result.

In the following post I presented two possible alternatives for accurate scalar half-precision sin() implementations; you would have to do your own packing/unpacking for a half-precision vector version. How these compare to h2sin() performance-wise I do not know.

https://devtalk.nvidia.com/default/topic/982827/on-the-utility-of-sfu-instructions-for-half-precision-math-functions/

Generally speaking, for math functions implemented in software, the performance benefits of going to lower precision rapidly diminish when reducing precision below single precision. Also generally speaking, I am skeptical of the use of half-precision for computation, as opposed to use as a storage format, except in certain special cases. Rounding errors accumulating in even simple formulas can quickly lead to fewer than 8 “good” bits in the final result, which is usually the minimum required for many real-life scenarios.

Thank you very much njuffa!

You were right, I was taking advantage of the fact that with half2 I was reducing memory access by half. With a more intense computational task (as you proposed) h2sin() seems to be even slower than two calls to sin():

Single precision

#include <stdio.h>

#define Fs 16000.0

__global__
void generate_time_kernel(int n, float *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) t[i] = i / Fs;
}

__global__
void sin_intense_kernel(int n, float *x, float *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) {
    float aux = t[i];
    for (int j=0; j<10; j++) aux = sin(aux);
    x[i] = aux;
  }
}

int main(void)
{
  int N = 1<<20; // 1<<20 = 1048576
  float *x, *t, *d_x, *d_t;
  t = (float*)malloc(N*sizeof(float));
  x = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_t, N*sizeof(float));
  cudaMalloc(&d_x, N*sizeof(float)); 

  generate_time_kernel<<<(N+255)/256, 256>>>(N, d_t);
  sin_intense_kernel<<<(N+255)/256, 256>>>(N, d_x, d_t);

  cudaFree(d_x);
  cudaFree(d_t);
  free(x);
  free(t);
}

Profiling results:

.           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   83.15%  368.87us         1  368.87us  368.87us  368.87us  sin_intense_kernel(int, float*, float*)
                   16.85%  74.752us         1  74.752us  74.752us  74.752us  generate_time_kernel(int, float*)
      API calls:   99.28%  188.35ms         2  94.176ms  153.25us  188.20ms  cudaMalloc
                    0.41%  784.24us         2  392.12us  240.55us  543.69us  cudaFree
                    0.16%  299.76us         1  299.76us  299.76us  299.76us  cuDeviceTotalMem
                    0.11%  204.44us        96  2.1290us     139ns  67.815us  cuDeviceGetAttribute
                    0.03%  51.735us         2  25.867us  10.949us  40.786us  cudaLaunchKernel
                    0.01%  16.939us         1  16.939us  16.939us  16.939us  cuDeviceGetName
                    0.00%  2.6020us         1  2.6020us  2.6020us  2.6020us  cuDeviceGetPCIBusId
                    0.00%  1.7430us         3     581ns     207ns     945ns  cuDeviceGetCount
                    0.00%     930ns         2     465ns     212ns     718ns  cuDeviceGet
                    0.00%     307ns         1     307ns     307ns     307ns  cuDeviceGetUuid

Half precision

#include <stdio.h>
#include <cuda_fp16.h>

#define Fs 16000.0
#define h2Fs __floats2half2_rn(Fs,Fs)

__global__
void generate_time_mp_kernel(int n, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2) t[i] = __floats2half2_rn(2.0*i/Fs, (2.0*i+1)/Fs);
}

__global__
void sin_intense_mp_kernel(int n, half2 *x, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2)  {
    half2 aux = t[i];
    for (int j=0; j<10; j++) aux = h2sin(aux);
	x[i] = aux;
  }
}

int main(void)
{
  int N = 1<<20; // 1<<20 = 1048576
  half2 *x, *t, *d_x, *d_t;
  t = (half2*)malloc(N/2*sizeof(half2));
  x = (half2*)malloc(N/2*sizeof(half2));

  cudaMalloc(&d_t, N/2*sizeof(half2));
  cudaMalloc(&d_x, N/2*sizeof(half2)); 

  generate_time_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_t);
  sin_intense_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_x, d_t);

  cudaFree(d_x);
  cudaFree(d_t);
  free(x);
  free(t);
}

Profiling results:

.           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   87.44%  401.06us         1  401.06us  401.06us  401.06us  sin_intense_mp_kernel(int, __half2*, __half2*)
                   12.56%  57.600us         1  57.600us  57.600us  57.600us  generate_time_mp_kernel(int, __half2*)
      API calls:   99.41%  188.67ms         2  94.337ms  143.82us  188.53ms  cudaMalloc
                    0.37%  697.96us         2  348.98us  143.04us  554.92us  cudaFree
                    0.10%  185.15us         1  185.15us  185.15us  185.15us  cuDeviceTotalMem
                    0.09%  165.81us        96  1.7270us     119ns  50.786us  cuDeviceGetAttribute
                    0.03%  48.834us         2  24.417us  11.529us  37.305us  cudaLaunchKernel
                    0.01%  16.714us         1  16.714us  16.714us  16.714us  cuDeviceGetName
                    0.00%  2.8030us         1  2.8030us  2.8030us  2.8030us  cuDeviceGetPCIBusId
                    0.00%  2.6500us         3     883ns     169ns  1.8800us  cuDeviceGetCount
                    0.00%  1.0020us         2     501ns     228ns     774ns  cuDeviceGet
                    0.00%     247ns         1     247ns     247ns     247ns  cuDeviceGetUuid

I will try to use your sincos() implementation to make my own half2 sine function. As the argument reduction is based on fmaf() I should be able to perform two 16bits reductions in parallel using __hfma2(). By the way, why argument reduction is necessary before calling the SFU sine function? Reading its documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#floating-point-instructions-sin) I thought it supported any float input.

I’ve just realized that your reduction code is exploiting the fact that you are working with 32bits when you only really care about 16bits, so I cannot just replace fmaf() by __hfma2().

If I want to use __hfma2() to compute two sine functions in parallel maybe it is better to work over the implementation you proposed here: https://devtalk.nvidia.com/default/topic/960757/cuda-programming-and-performance/a-faster-and-more-accurate-implementation-of-sincosf
isn’t it?

I don’t need to compute the sine function of big numbers; actually, all the inputs of the function will be in the interval [-200,200] and the bigger they are the less critical are their errors (I will use the sine function to implement a sinc function).

EDIT: I have just realized that 200 is actually quite big for float16… However, I think that, in my application, working with an int16 variable to store the integer closer to x/pi and a float16 to store x/pi - round(x/pi) would be possible. In this case, I would not need any argument reduction, isn’t it?

As I recall, I explained in the post I linked that the hardware argument reduction (the instruction RRO) used in conjunction with the hardware sine instruction (MUFU.SIN) introduces a phase error that makes the results of a half-precision function inaccurate (large ulp error) without correction, even with the very limited range of half-precision operands (|x| < 2**16).

Since your operands are even more range-restricted than that, that correction may not be necessary and you may be able to get away with calling raw_sin() directly. So your own h2sin() replacement would comprise: unpacking from half2 into two floats, two calls to raw_sin(), packing the float results into a half2.

From an ulp-error perspective just using raw_sin() leads to seemingly unacceptably large errors of up to ~= 120 ulps in [-200,200], but you stated that larger errors for larger arguments are acceptable in your use case.

As for your edit, it would be advantageous to know if your use case utilizes the normalized or the unnormalized sinc() function (see Wikipedia: https://en.wikipedia.org/wiki/Sinc_function). The normalized sinc() function can be constructed via sinpi() which lends itself to a straightforward argument reduction, but would require a custom core approximation for half precision.

Before we dive to that level, and in light of previous discussion about memory-limited vs compute-limited application-level performance: Do you have conclusive evidence from profiling that the computation of sinc() is actually a critical path in your code?

Thank you very much njuffa. It was the first time that I worked with half-precision arithmetics (and I do not have a long CUDA experience either) so at first I was a bit lost, but I think I am in the right track now.

My application is clearly compute-limited, however, it was true that the sinc function was not its only limitation. I have been optimizing the half-precision arithmetics of other parts of the kernel and I have finally achieved x1.6 speedup compared with the single precision arithmetics. It is not the x2 speedup that may be possible in theory, but I think it is a good enough result.

Actually, I found that the sine function is not even the only limitation for a sinc function when working with half2 variables. Something as simple as (x==0)? 1 : sin(x)/x becomes quite tricky when x is a half2 variable that stores two values and only one of them may be 0. After trying several implementations, I found that the most efficient way to do it was to unpack the two values of the half2 to two float variables, call two times to my sinc(float) function and then pack them again to half2.

I am not very happy about NVIDIA falling back to explicit SIMD to boost FP16 performance compared to FP32. This immediately raises the kind of issues about conditionals that you point out and that we thought we had left behind with CUDA’s SIMT model.

You should be able to replace the conditional with a call to __heq2(), which has a vector return that returns 1.0 for true and 0.0 for false. sin(x)/x is well defined except at zero, where it returns NaN, and sin(x)/x is 1.0 to within floating-point precision whenever |x| is tiny.

So by conditionally adding a tiny increment to x when x is zero, based on the output of __heq2(), you can then continue with the normal sinc() computation:

x = __hfma2 (__heq2 (x, zero), tiny, x); // x = (x==0) ? (x+tiny) : (x+0)
r = __h2div (__h2sin (x), x);            // sinc(x) = sin(x)/x

Above code written in browser and untested. “tiny” would be a half2-vector of something like 6e-8.

BTW, I wonder why its __h2div() instead of __hdiv2() according to the documentation.

I have finally implemented the sinc function as:

__device__ __forceinline__ half2 my_h2sin(half2 x) {
	return __floats2half2_rn(sinf(__low2float(x)), sinf(__high2float(x)));
}

__device__ __forceinline__ half2 my_h2sinc(half2 x) {
	x = __hfma2(__heq2(x, h2zeros), __float2half2_rn(1e-7f), x);
	return __h2div(my_h2sin(x), x);
}

I do not why, but my_h2sin() is faster than h2sin() (at least in my application). Interestingly, I use h2cos in another part of the code and it does not have this issue.

This is not the only mystery around __h2div()… It is in the documentation of the half arithmetic functions https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__ARITHMETIC.html#group__CUDA__MATH____HALF__ARITHMETIC instead of the documentation of the half2 arithmetic functions https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF2__ARITHMETIC.html#group__CUDA__MATH____HALF2__ARITHMETIC, and I am not sure right now, but I think I have had problems with that function in some CUDA versions and I needed to remove the __ and just write h2div() instead.

Are you compiling your code with -use_fast_math, which basically turns sinf() into __sinf()?

I am not using -use_fast_math, but when I tried to replace sinf() with __sinf() the application became a bit slower.

That doesn’t make sense to me right now.

Neither to me. With the original float implementation:

__device__ __forceinline__ scalar_t sinc(scalar_t x) {
	return (x==0)? 1 : sinf(x)/x; 
}

The application runs faster with __sinf() as it would be expected.

May there be any drawback in calling __sinf() two times too closely in the same thread as I do in the half2 implementation?

I have made a smaller test:

#include <stdio.h>
#include <cuda_fp16.h>

#define Fs 16000.0
#define h2Fs __floats2half2_rn(Fs,Fs)


__device__ __forceinline__ half2 my_h2sin(half2 x) {
	return __floats2half2_rn(__sinf(__low2float(x)), __sinf(__high2float(x)));
}

__global__
void generate_time_mp_kernel(int n, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2) t[i] = __floats2half2_rn(2.0*i/Fs, (2.0*i+1)/Fs);
}

__global__
void sin_intense_mp_kernel(int n, half2 *x, half2 *t)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n/2)  {
    half2 aux = t[i];
    for (int j=0; j<10; j++) aux = my_h2sin(aux);
	x[i] = aux;
  }
}

int main(void)
{
  int N = 1<<20; // 1<<20 = 1048576
  half2 *x, *t, *d_x, *d_t;
  t = (half2*)malloc(N/2*sizeof(half2));
  x = (half2*)malloc(N/2*sizeof(half2));

  cudaMalloc(&d_t, N/2*sizeof(half2));
  cudaMalloc(&d_x, N/2*sizeof(half2)); 

  generate_time_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_t);
  sin_intense_mp_kernel<<<(N/2+255)/256, 256>>>(N, d_x, d_t);

  cudaFree(d_x);
  cudaFree(d_t);
  free(x);
  free(t);
}

And here __sinf is clearly faster than sinf (21.6us vs 80.7us), so there must be something strange in my application. I do not think that the result of the sine function can affect the performance of the kernel (it is not used to control the exit of a loop or anything like that) but it is the only thing that comes to mind. I need to make a deeper review of the code of my application.

The CUDA profiler should be able to tell you how bottlenecks shift with the replacement of sinf() with __sinf(). It is possible for code to become bottlenecked on SFU throughput, but this is fairly rare in my experience.

The only scenario that I can imagine right now is that of constant propagation by the compiler, which may be possible for sinf() but not __sinf(), as that would require the compiler to model the RRO.SINCOS and MUFU.SIN instructions in bit-accurate fashion.

Caveats mentioned earlier with regard to the use of __sinf() outside a full revolution apply although (as you pointed out) this may not be a problem in terms of a sinc() computation on a limited range needed here.

I think that, as you said, all the issue was about compiler optimizations that were not possible when using __sinf().

Reviewing the code I found that, hidden in inline functions, there was a line that I was executing in a loop inside of a kernel with always the same variables values. After removing it from the inline function and putting it outside the kernel and passing the result as a parameter of the kernel everything seems more normal. Now, the faster way to compute the sinc is using h2sin, followed by:

__device__ __forceinline__ half2 my_h2sin(half2 x) {
	return __floats2half2_rn(__sinf(__low2float(x)), __sinf(__high2float(x)));
}

and finally, the slowest is using sinf() instead of __sinf().

Thank you very much njuffa!

I have finally made my own custom implementation of sinpi(half2):

__device__ __forceinline__ half2 my_h2sinpi(half2 x) {
	// Argument reduction to [-0.5, 0.5]
	half2 i = h2rint(x);
	half2 r = __hsub2(x, i);
	
	// sin(pi*x) polinomial approximation for x in [-0.5,0.5]
	half2 r2 = __hmul2(r, r);
	half2 s = __float2half2_rn(+2.31786431325108f);
	s = __hfma2(r2, s, __float2half2_rn(-5.14167814230801f));
	s = __hfma2(r2, s, __float2half2_rn(+3.14087446786993f));
	s = __hmul2(s, r);
	
	half2 i_2 = __hmul2(i, __float2half2_rn(0.5f));
	half2 sgn = __hfma2(__float2half2_rn(-2.0f), 
			    __hne2(__hsub2(i_2, h2rint(i_2)), h2zeros), 
			    h2ones); // 1 if i is even, else -1: -2 * ((i/2-round(i/2))!=0) + 1
	s = __hmul2(s, sgn);
	
	return s;
}

It is not extremely accurate but it is enough for my application. I know that the argument reduction is typically made to [-0.25,0.25], but I do not need to compute cosine function and working with half2 I could not find a nice way to handle the case when a half need to be approximated by sin_poly and the other by cos_poly.

I also tried to replace lines 13-17 by:

if (((int)__high2float(i)) & 1) s = __hmul2(s, __floats2half2_rn(+1,-1));
if (((int)__low2float(i))  & 1) s = __hmul2(s, __floats2half2_rn(-1,+1));

but it is slower… I could not find any faster way to correct the sign.

For sinpi(x) on [-0.5,0.5], my minimax approximation generator (based on the Remez algorithm) suggests the following coefficients:

+2.32663808
-5.14580529
+3.14125280

Thanks njuffa! I simply took the optimal numbers (in a least-squares sense) but without taking into account the 16-bit precision, so I’m sure your coefficients are better.