Simulate tex2D fetch : what is implementation of linear interpolation?

I am trying to migrate some code that is using a tex2D fetch from a texture bound to a cudaArray over a simple 2D float array.
The fetch can occur with non-integer coordinates (float addressing).

I want to get rid of the cudaArray/tex code layers and directly fetch from the original input array.
Thus, to get the same results, I need to perform a similar 2D linear interpolation.

Currently, I don’t get the same results. I must be wrong in my implementation inspired by https://on-demand.gputechconf.com/gtc-express/2011/presentations/texture_webinar_aug_2011.pdf

Here is my implementation. Is there an obvious error ? Does the fetch2D linear interpolation work differently ?

float2 coords = ...
//add .5f to coordinates to use pixel "centers", and clamp
coords.x = max(0.f, min(coords.x+0.5f, imSize.x-1.f));//imSize.x is the width of the original 2D array
coords.y = max(0.f, min(coords.y+0.5f, imSize.y-1.f));//imSize.y is the height of the original 2D array

//compute floor and ceil of coordinates, for now I don't try to use a 8-bit quantization of the fractional part
const float xf = floorf(coords.x);
const float xc = min(ceilf(coords.x), imSize.x-1.f);
const float yf = floorf(coords.y);
const float yc = min(ceilf(coords.y), imSize.y-1.f);

//compute blending factors
const float tx = (coords.x-xf);
const float ty = (coords.y-yf);
const float one = 1;

//blend on x. (imSize.z is image stride in pixels)
const float p_x[2] = {(one-tx)*my2DArray[(int)yf*imSize.z+(int)xf]+tx*my2DArray[(int)yf*imSize.z+(int)xc], 
                                (one-tx)*my2DArray[(int)yc*imSize.z+(int)xf]+tx*my2DArray[(int)yc*imSize.z+(int)xc]};
//blend on y
const float p = (one-ty)*p_x[0]+ty*p_x[1];

[I did not review the posted code for errors; maybe other readers will volunteer for that]

The details of the hardware texture interpolation are documented in appendix J of the CUDA C++ Programming Guide, specifically section J.2:

As you can see there, the computation is performed in 9-bit fixed-point arithmetic with 8 fraction bits. Trying to simulate that exactly so results are bitwise identical to hardware interpolation will likely be computationally expensive. A common reason to switch to software interpolation is to increase the accuracy of the interpolation beyond what is afforded by the 8-bit quantization of the hardware interpolation.

The following NVIDIA blog post shows how to lerp accurately and efficiently with FP32 computation:

https://developer.nvidia.com/blog/lerp-faster-cuda/

The basic idea there is that when FMA (fused multiply-add) is available as a basic hardware operations, it is advantageous to transform computation of the form (1 - t) * a into fma (-t, a, a).

The details of the hardware texture interpolation are documented in appendix J of the CUDA C++ Programming Guide, specifically section J.2:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#linear-filtering

Exactly what I was looking for, but did not found by myself. Thanks for pointing it out.

As you can see there, the computation is performed in 9-bit fixed-point arithmetic with 8 fraction bits. Trying to simulate that exactly so results are bitwise identical to hardware interpolation will likely be computationally expensive

Agree : that’s why I did not try to to that in the first place, actually I did not want bit-exact result but comparable results

The doc you mentioned just tells me that my code should work. So there must be something else, perhaps a bug in the original implementation with the bound cudaArray. I will look for stupid things like an y flip, for instance.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.