Texture interpolation of packed fp16 (half2)

I’m able to use store fp16 values in a CUDA texture (via a surface mapping the underlying array), and sample them back, as in the following snippet:

   __device__ void setTextureValue(const float4 value, int32_t x, int32_t y, int32_t z) {
surf3Dwrite<ushort4>(
        make_ushort4(
            __half_as_ushort(__float2half_rn(value.x)),
            __half_as_ushort(__float2half_rn(value.y)),
            __half_as_ushort(__float2half_rn(value.z)),
            __half_as_ushort(__float2half_rn(value.w))),
        surface_,
        x * sizeof(ushort4),
        y,
        z);
  }

  __device__ float4 sampleTextureValue(float x, float y, float z) const {
    return tex3D<float4>(texture_, x + 0.5f, y + 0.5f, z + 0.5f);
  }

This works, and makes use of the texture unit to perform hardware interpolation. However, I think it’s not as efficient as it could be. The first issue I see is that there is no way to return a __half4 value from the texture sampling, forcing conversion from the float4 back into a __half4 for further downstream processing. This is cheap enough that it’s not a huge deal (at the moment) though.

The second (arguably more substantial) issue is that I think potential compute is being wasted. In particular, Pascal introduced FPUs which are able to process either one FP32 value or a __half2 value with a single instruction - i.e. each FPU can process 2 fp16 values at once. I’m not sure whether the same is true for whatever compute units are performing the hardware interpolation, but assuming it is, what code would I need to write to take advantage of it during texture interpolation?

Assuming interpolation is performed on 4x __half2 values, the return type would be either float8 (assuming the upsampling to float needs to happen for output), or more ideally uint4 such that the underlying interpolated __half2 can be recovered with simply a reinterpret_cast<__half2>(sampled_uint4[0 /* or 1 or 2 or 3 */]).

Hardware texture interpolation is performed in 1.8 bit fixed-point arithmetic (see Appendix J of the CUDA Programming Guide) inside the multi-function unit (MUFU) that also computes certain low-accuracy single-precision math functions (in modern GPUs: MUFU.RCP, MUFU.RSQ, MUFU.SQRT, MUFU.EX2, MUFU.LG2, MUFU.SIN, MUFU.COS, MUFU.TANH, MUFU.RCP64H, MUFU.RSQ64H). For details see:

S.F. Oberman and M.Y. Siu, “A high-performance area-efficient multifunction interpolator”, In 17th IEEE Symposium on Computer Arithmetic (ARITH’05), 27-29 June 2005, pp. 272-279

As long as the 256-level quantization of the hardware texture interpolator is sufficient for the use case I do not think it can be beat performance wise.

In today’s world, various applications outside of classical graphics (e.g. some high-resolution CT scans) do require better accuracy than what the hardware texture interpolation can provide, and in those cases FMA-based interpolation in software is the way to go, e.g. using the formulas in Appendix J. How much performance degradation is incurred depends on the details of the use case and in particular on how memory intensive it is. I seem to recall factor 2-3 slowdown, but my memory is hazy.

If you decide to go down the path of coding up your own FMA-based interpolation, it would be very interesting to see a summary of the performance relative to use of the hardware interpolator posted in this thread.

This may be of interest.

When I look in /usr/local/cuda/include/channel_descriptor.h I see channel format descriptor helper functions for up to half4. So it seems reasonable to me to give that a try. I haven’t tried it myself and can’t speak to performance, and don’t know of any available channel format descriptor or channel format kind for 4x half2.