Half precision reciprocals in OpenCL

In my OpenCL code, 1.0h / value will work for half precision float values.

But when I use OpenCL’s built in function for that: a = half_recip(b) it will fail compilation, with a and b half precision floats.

The source:

#pragma OPENCL EXTENSION cl_khr_fp16 : enable

__kernel
void foo
(
	__global const half* dst,
	__global half* src
)
{
	const uint index = get_global_id(0);
	dst[index] = half_recip(src[index]);
}

The compiler output:

<kernel>:11:15: error: call to 'native_recip' is ambiguous
        dst[index] = half_recip(src[index]);
                     ^~~~~~~~~~~~~~~~~~~~~~
cl_kernel.h:1182:23: note: expanded from macro 'half_recip'
#define half_recip(x) native_recip(x)
                      ^~~~~~~~~~~~
cl_kernel.h:1164:24: note: candidate function
float __OVERLOADABLE__ native_recip(float);
                       ^
cl_kernel.h:1173:25: note: candidate function
double __OVERLOADABLE__ native_recip(double);
                        ^
cl_kernel.h:1165:25: note: candidate function
float2 __OVERLOADABLE__ native_recip(float2); 
                        ^
cl_kernel.h:1167:25: note: candidate function
float3 __OVERLOADABLE__ native_recip(float3); 
                        ^
cl_kernel.h:1169:25: note: candidate function
float4 __OVERLOADABLE__ native_recip(float4); 
                        ^
cl_kernel.h:1170:25: note: candidate function
float8 __OVERLOADABLE__ native_recip(float8); 
                        ^
cl_kernel.h:1171:26: note: candidate function
float16 __OVERLOADABLE__ native_recip(float16); 
                         ^
cl_kernel.h:1174:26: note: candidate function
double2 __OVERLOADABLE__ native_recip(double2); 
                         ^
cl_kernel.h:1176:26: note: candidate function
double3 __OVERLOADABLE__ native_recip(double3); 
                         ^
cl_kernel.h:1178:26: note: candidate function
double4 __OVERLOADABLE__ native_recip(double4); 
                         ^
cl_kernel.h:1179:26: note: candidate function
double8 __OVERLOADABLE__ native_recip(double8); 
                         ^
cl_kernel.h:1180:27: note: candidate function
double16 __OVERLOADABLE__ native_recip(double16); 

This compile error makes no sense: the argument and lvalue are both of type half-precision-float, so it should have no issue whatsoever to select the variant for the overloaded function.

Using Intel or AMD, this OpenCL kernel compiles just fine, but not when using nvidia’s OpenCL.

FULL_PROFILE OpenCL 3.0 CUDA 12.1.68 NVIDIA CUDA NVIDIA Corporation

NVIDIA GeForce RTX 3070 NVIDIA Corporation with [46 units] localmem=48KiB globalmem=7970MiB dims=3(1024x1024x64) max workgrp sz 1024

From what I have seen, NVIDIA doesn’t support the cl_khr_fp16 extension, but I haven’t checked lately (edit: see here). The compiler output here seems to suggest that. There is no indication of a candidate function that the compiler can find that takes a half argument (look through the list - its not there.) Therefore the compiler finds a number of possible alternatives (via type conversion) that it sees as equally usable, therefore the substitution is “ambiguous”. It makes sense to me, anyway, from what I see here.

I’m puzzled by the use of

const on dst but not src, although that doesn’t seem to be at issue here.

Oops… the const was a typo, when I tried to make a minimal case of my code.

I was confused about the candidate list, I think.

I assumed float16 was ‘half’ or ‘fp16’ but I guess I was wrong, and it is 16 float matrix instead?

I can happily feed it kernel source that uses ‘half’ though. Is that not what the extension does? Make half operations available?

float16 has the same syntax as double16, float8, float4 and float2. They are vector types.

I’m not an expert on what is or isn’t possible with NVIDIA OpenCL. The compiler seems to be making a rational statement to me, however. You’re welcome to file a bug if you think that half_recip should be usable (this way); it might be a bug, or it might just be unsupported.

Thank you.

I think the following is the case: FP16 operations in OpenCL are possible, both in Nvidia’s and AMD’s implementation.

But both report that they do not support cl_khr_fp16 when queried via clGetDeviceInfo CL_DEVICE_EXTENSIONS.

Intel reports cl_khr_fp16 supported.

I guess the support is partial on nvidia and amd, and therefore not advertised via the device info.

For completeness, these are the extension reported for the rtx3070:

cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.