I am trying to migrate some code that is using a tex2D fetch from a texture bound to a cudaArray over a simple 2D float array.
The fetch can occur with non-integer coordinates (float addressing).

I want to get rid of the cudaArray/tex code layers and directly fetch from the original input array.
Thus, to get the same results, I need to perform a similar 2D linear interpolation.

Here is my implementation. Is there an obvious error ? Does the fetch2D linear interpolation work differently ?

float2 coords = ...
//add .5f to coordinates to use pixel "centers", and clamp
coords.x = max(0.f, min(coords.x+0.5f, imSize.x-1.f));//imSize.x is the width of the original 2D array
coords.y = max(0.f, min(coords.y+0.5f, imSize.y-1.f));//imSize.y is the height of the original 2D array
//compute floor and ceil of coordinates, for now I don't try to use a 8-bit quantization of the fractional part
const float xf = floorf(coords.x);
const float xc = min(ceilf(coords.x), imSize.x-1.f);
const float yf = floorf(coords.y);
const float yc = min(ceilf(coords.y), imSize.y-1.f);
//compute blending factors
const float tx = (coords.x-xf);
const float ty = (coords.y-yf);
const float one = 1;
//blend on x. (imSize.z is image stride in pixels)
const float p_x[2] = {(one-tx)*my2DArray[(int)yf*imSize.z+(int)xf]+tx*my2DArray[(int)yf*imSize.z+(int)xc],
(one-tx)*my2DArray[(int)yc*imSize.z+(int)xf]+tx*my2DArray[(int)yc*imSize.z+(int)xc]};
//blend on y
const float p = (one-ty)*p_x[0]+ty*p_x[1];

As you can see there, the computation is performed in 9-bit fixed-point arithmetic with 8 fraction bits. Trying to simulate that exactly so results are bitwise identical to hardware interpolation will likely be computationally expensive. A common reason to switch to software interpolation is to increase the accuracy of the interpolation beyond what is afforded by the 8-bit quantization of the hardware interpolation.

The following NVIDIA blog post shows how to lerp accurately and efficiently with FP32 computation:

The basic idea there is that when FMA (fused multiply-add) is available as a basic hardware operations, it is advantageous to transform computation of the form (1 - t) * a into fma (-t, a, a).

Exactly what I was looking for, but did not found by myself. Thanks for pointing it out.

As you can see there, the computation is performed in 9-bit fixed-point arithmetic with 8 fraction bits. Trying to simulate that exactly so results are bitwise identical to hardware interpolation will likely be computationally expensive

Agree : that’s why I did not try to to that in the first place, actually I did not want bit-exact result but comparable results

The doc you mentioned just tells me that my code should work. So there must be something else, perhaps a bug in the original implementation with the bound cudaArray. I will look for stupid things like an y flip, for instance.