read half2 directly from 2D texture

Is there a way to read half2 values directly from a 2D texture with half precision data?

I am converting CUDA code to use half precision and would like to use:

half2 val = tex2D(tex, x, y);

instead of the current:

float2 val = tex2D(tex, x, y);

The version with does not compile. The documentation seems to indicate that this is not possible, but I wanted to make sure I’m not missing something. I would like to avoid the conversion to FP32 and then convert back to FP16.

The compile error:
/usr/local/cuda/include/texture_indirect_functions.h(262): error: no instance of overloaded function “tex2D” matches the argument list
argument types are: (half2 *, cudaTextureObject_t, float, float)
detected during:
instantiation of “T tex2D(cudaTextureObject_t, float, float) [with T=half2]”

Thanks, Troy.

the texture interpolator (if you intend to use it) would always return float/float2/float4 .

You could try cudaReadModeElementType for point sampling, and maybe it’s possible to obtain the half2 as ushort2 (earlier versions of CUDA certainly supported it). Then apply the following casting function to the ushort2 vector components. It will not actually perform any costly conversion but simply reinterpret the ushort value as a half float.

device ​ __half __ushort_as_half ( const unsigned short int i )


See also

The binary type should be ushort or ushort2

Note you can derive from the ‘__half’ struct, in order to gain access to the protected member variable ‘__x’ (I hope does not get changed in future Cuda Toolkits)

Thank you Christian and HannesF99 for the information. Unfortunately I do need to use interpolation during the texture read so the ushort approaches will not help.

It appears that reading out a half2 from a texture with interpolation is not currently supported. The CUDA documentation could use some improvement in this area, and with half precision in general.


you can easily implement the bilinear interpolation (and also the border mode handling) by yourself instead of relying on the texture hardware. Unless your kernel is really compute bound (which is usually not the case), the few additional arithmetic operation will not really affect the kernel runtime negatively.

Based on my experience with various use cases, I will boldly claim that this should be the default approach these days. On modern GPUs, FP32 floating-point operations are “too cheap to meter”, and the quality of the interpolation is much better when done with FP32 versus the 9-bit (that is, 1.8) fixed-point arithmetic utilized by the texture units.

Only where this approach is not fast enough AND the quality degradation from use of hardware interpolation is acceptable should interpolation via the texture units be chosen.

To follow up on my original question and the suggestions made, I implemented the linear interpolation in device code using half precision intrinsics to replace the 2D texture read, but I observed significantly longer kernel runtime. The interpolation is actually very simple, a subtraction and a fused multiply add. However, properly handling border/edge conditions makes the algorithm more complicated. And, in my case, the input data does not fit in constant memory or shared memory, forcing the interpolation to read from global memory. So it appears that in my case, using half2 val = __float22half2_rn(tex2D(tex, x, y)) is the best solution for interpolating half2 data.