If you look at Appendix D.2 in the programming guide, you see how the interpolation is calculated. One thing that is noted is that the interpolating fraction (the number which goes from 0 to 1 between two points) only has 8-bits of precision on the fractional part. That is about the right size for the error you see.
There is no way to improve this precision using the hardware interpolator. If you want to do better, you will have to compute the interpolation yourself (single precision will do much better than 0.1%) in the CUDA kernel.
FYI, here’s a drop-in function to do a higher-precision interpolated texture lookup. It assumes the texture reference is set to non-normalized coordinates and point sampling.
template<class T, class R> // return type, texture type
__device__
R tex2DBilinear(const texture<T, 2, cudaReadModeNormalizedFloat> tex, float x, float y)
{
x -= 0.5f;
y -= 0.5f;
float px = floorf(x); // integer position
float py = floorf(y);
float fx = x - px; // fractional position
float fy = y - py;
px += 0.5f;
py += 0.5f;
return lerp( lerp( tex2D(tex, px, py), tex2D(tex, px + 1.0f, py), fx ),
lerp( tex2D(tex, px, py + 1.0f), tex2D(tex, px + 1.0f, py + 1.0f), fx ), fy );
}