In my OpenCL code, 1.0h / value
will work for half precision float values.
But when I use OpenCL’s built in function for that: a = half_recip(b)
it will fail compilation, with a
and b
half precision floats.
The source:
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel
void foo
(
__global const half* dst,
__global half* src
)
{
const uint index = get_global_id(0);
dst[index] = half_recip(src[index]);
}
The compiler output:
<kernel>:11:15: error: call to 'native_recip' is ambiguous
dst[index] = half_recip(src[index]);
^~~~~~~~~~~~~~~~~~~~~~
cl_kernel.h:1182:23: note: expanded from macro 'half_recip'
#define half_recip(x) native_recip(x)
^~~~~~~~~~~~
cl_kernel.h:1164:24: note: candidate function
float __OVERLOADABLE__ native_recip(float);
^
cl_kernel.h:1173:25: note: candidate function
double __OVERLOADABLE__ native_recip(double);
^
cl_kernel.h:1165:25: note: candidate function
float2 __OVERLOADABLE__ native_recip(float2);
^
cl_kernel.h:1167:25: note: candidate function
float3 __OVERLOADABLE__ native_recip(float3);
^
cl_kernel.h:1169:25: note: candidate function
float4 __OVERLOADABLE__ native_recip(float4);
^
cl_kernel.h:1170:25: note: candidate function
float8 __OVERLOADABLE__ native_recip(float8);
^
cl_kernel.h:1171:26: note: candidate function
float16 __OVERLOADABLE__ native_recip(float16);
^
cl_kernel.h:1174:26: note: candidate function
double2 __OVERLOADABLE__ native_recip(double2);
^
cl_kernel.h:1176:26: note: candidate function
double3 __OVERLOADABLE__ native_recip(double3);
^
cl_kernel.h:1178:26: note: candidate function
double4 __OVERLOADABLE__ native_recip(double4);
^
cl_kernel.h:1179:26: note: candidate function
double8 __OVERLOADABLE__ native_recip(double8);
^
cl_kernel.h:1180:27: note: candidate function
double16 __OVERLOADABLE__ native_recip(double16);
This compile error makes no sense: the argument and lvalue are both of type half-precision-float, so it should have no issue whatsoever to select the variant for the overloaded function.
Using Intel or AMD, this OpenCL kernel compiles just fine, but not when using nvidia’s OpenCL.
FULL_PROFILE OpenCL 3.0 CUDA 12.1.68 NVIDIA CUDA NVIDIA Corporation
NVIDIA GeForce RTX 3070 NVIDIA Corporation with [46 units] localmem=48KiB globalmem=7970MiB dims=3(1024x1024x64) max workgrp sz 1024