3D Geographic Interpolation too inaccurate How to best deal with poor texture interpolation?

dneckels · December 1, 2010, 5:17pm

I have a 3d grid of geographic points and I am performing interpolation between these.
My first attempts was using 3D linear filtering, but it turns out that the accuracy suffered too much.

So I have converted my kernel to perfrom 8 texture lookups (corners of the cube) using POINT filtering, and
them I perform the standard bilinear interpolation in the kernel.

Unfortunately this reduces the bandwidth of my kernel from about 700M pixels per second to more like 250 Mpp.

Any thoughts how to better this?
I am unsure whether the computation of the texture lookups is responsible for the slowdown. My guess would be the memory accesses.

thx in advance.

Dittoaway · December 1, 2010, 8:35pm

I’m not clear on what you’re doing.
Texture interpolation IS (quantized) linear so going from software interpolation to hardware interpolation won’t help accuracy.
What do you mean by texture lookups?
You’re not using a 3D texture?
Bilinear? That’s 2D.

dneckels · December 1, 2010, 10:01pm

Let me break it down:

hardware linear texture filtering for (my) geographic grids is too inaccurate, but very fast.
linear filtering manually in the shader works (accuracy wise), but slows things down unfavorably.
Any ideas what to do? Is 2) the optimal solution or is there something better?

In 2) I am using (nearest neighbor) texture lookups in the hope of getting help from the memory cache.
I am using tex3D, as my texture is 3D.
Okay, not bilinear but, technically 3 dimensional linear langrangian tensor product interpolation.

And, ok, I admit what I am really asking is whether bending over backwards to implement a kernel that prefetches shared memory is likely to buy me a good speedup…
And open to any ideas I hadn’t thought about…

fna · December 1, 2010, 11:16pm

A fast 3D spline interpolation based on linear interpolation lookups can be used to be more accurate.
And here’s a pretty sweet package for it as well.

http://www.dannyruijters.nl/cubicinterpolation/

shawkie · December 2, 2010, 1:12am

Its hard to predict whether prefetching to shared memory will be faster or not. From a bandwidth point of view it should be a big win. The throughput of the texture units is only a fraction of the global memory bandwidth if you only have a few bytes per texel (although its not so bad if you have say a float4 per texel). The downside is that you would have to manage the caching/prefetching and that will probably cost you in instructions and registers.

If you can target Fermi then you can just read directly from global memory and rely on the L1 and L2 caches. You would probably want to use an appropriate layout for your data (some kind of 3D z-curve). This should be much faster than using nearest neighbour texture lookups.

One other tip is that in HLSL (but not CUDA as far as I know) there is an instruction that fetches all 4 texels of a neigbourhood from a 2D image in a single sampling operation. I belive this is designed for exactly what you are trying to do (although only in 2D). Maybe some kind of DirectX interop would work for you.

dneckels · December 2, 2010, 3:56pm

Thanks a lot, shawkie for your ideas.

Interesting points on texture bandwidth.

And thanks fna, I think I understand what you suggest. My final approach is similar in its idea, I think:

I stumbled on a counterintuitive solution that brought back most of my throughput:

At the beginning of the kernel I perform the expensive self-filtered texture lookups and interpolations at the four corners of the geography that corresponds to my block. I store the results in shared memory.

After sync’ing up, in the kernel I now perform arithmetic linear interpolation at each point using the corner posts.
Its the same amount of computation (more, actually), but close to mathematically equivalent to interpolating at each point.
But the number of texture lookups is far less.

I guess the thing I am using here is that I know my geographic grid is much coarser than the size of my block, so my block will be mostly within a cell. For blocks on cell boundaries I will get a slightly different result than the full interpolation (I could be cutting a corner), but likely it will work well enough!!

Simon_Green · December 3, 2010, 5:29pm

This will be supported in CUDA soon (in fact if you search in the 3.2 “texture_fetch_functions.h” for tex2Dgather() it’s already there).

dneckels · December 3, 2010, 7:59pm

I am assuming that the bandwidth for this is faster than the separate lookups?

I do see the functions. But, of course, I would need the 3D counterparts. Are these planned?

-Thanks

shawkie · December 4, 2010, 1:43pm

My understanding is that you get the four neighbours for the price of one lookup instead of four. My guess is that we won’t see 3D versions because of the amount of data that would have to be returned (up to 4 components from 8 different voxels - I don’t think so somehow). But my guess is that you’d probably get some decent results using a stack of 2D images.

It’s very interesting to hear that this feature is coming to CUDA. I wonder if there might be some other texturing improvements on the way too…

AndrewChan2022 · December 19, 2024, 4:16pm

Now it is 2024, I share my experience of linear interpolation:

the unnormalized coord has 0.5 offset, and has margin of [0, 0.5], [w-0.5, width], which is outside.
normalized coord also has margin [0, 0.5/(width)], [(w-0.5)/width, 1], so entire region is not [0, 1].
sample on postion of 1/2^n is accurate, other place is inaccurate.

example of 4x1x1 tex3d


        //              width = 4
        //              |-----o-----|-----o-----|-----o-----|-----o-----|
        //      node    0           1           2           3           4   node count = 5, node max index = 4      // unnormalized coord
        //  cell/texel        0           1           2           3         cell count = 4, cell max index = 3      // normalied coord count
        //      seg            -----------  ---------   ----------
        //      seg                0            1           2               seg count = 3,  seg max index 2         // normalized corrd segment
        //
        //                    |                                    |
        //      norm    0   0.125       0.325      0.625         0.875  1
        //      unnorm  0    0.5                                  3.5   4

        // 🌙❤️ dx between two texel must be 2^n or 2^-n
        //  texel,  value on it
        //  node,   bounding box

code:

output

Node (8, 0, 0) (2.000000 0.000000 0.000000): Value = 2.000000, (+0.125000 +0 +0) Offset Value = 2.125000 dv:0.125000
Node (9, 0, 0) (2.250000 0.000000 0.000000): Value = 2.250000, (+0.125000 +0 +0) Offset Value = 2.375000 dv:0.125000
Node (10, 0, 0) (2.500000 0.000000 0.000000): Value = 2.500000, (+0.125000 +0 +0) Offset Value = 2.625000 dv:0.125000
Node (11, 0, 0) (2.750000 0.000000 0.000000): Value = 2.750000, (+0.125000 +0 +0) Offset Value = 2.875000 dv:0.125000
Node (12, 0, 0) (3.000000 0.000000 0.000000): Value = 3.000000, (+0.125000 +0 +0) Offset Value = 3.000000 dv:0.000000
Node (13, 0, 0) (3.250000 0.000000 0.000000): Value = 3.000000, (+0.125000 +0 +0) Offset Value = 3.000000 dv:0.000000
Node (14, 0, 0) (3.500000 0.000000 0.000000): Value = 3.000000, (+0.125000 +0 +0) Offset Value = 3.000000 dv:0.000000
Node (15, 0, 0) (3.750000 0.000000 0.000000): Value = 3.000000, (+0.125000 +0 +0) Offset Value = 3.000000 dv:0.000000
Node (0, 0, 0) (0.000000 0.000000 0.000000): Value = 0.000000, (+0.125000 +0 +0) Offset Value = 0.125000 dv:0.125000
Node (1, 0, 0) (0.250000 0.000000 0.000000): Value = 0.250000, (+0.125000 +0 +0) Offset Value = 0.375000 dv:0.125000
Node (2, 0, 0) (0.500000 0.000000 0.000000): Value = 0.500000, (+0.125000 +0 +0) Offset Value = 0.625000 dv:0.125000
Node (3, 0, 0) (0.750000 0.000000 0.000000): Value = 0.750000, (+0.125000 +0 +0) Offset Value = 0.875000 dv:0.125000
Node (4, 0, 0) (1.000000 0.000000 0.000000): Value = 1.000000, (+0.125000 +0 +0) Offset Value = 1.125000 dv:0.125000
Node (5, 0, 0) (1.250000 0.000000 0.000000): Value = 1.250000, (+0.125000 +0 +0) Offset Value = 1.375000 dv:0.125000
Node (6, 0, 0) (1.500000 0.000000 0.000000): Value = 1.500000, (+0.125000 +0 +0) Offset Value = 1.625000 dv:0.125000
Node (7, 0, 0) (1.750000 0.000000 0.000000): Value = 1.750000, (+0.125000 +0 +0) Offset Value = 1.875000 dv:0.125000

Topic		Replies	Views
Understanding CUDA texture 2D linear interpolation CUDA Programming and Performance	4	2667	May 10, 2022
texture interpolation CUDA Programming and Performance	9	13029	September 23, 2009
Fast texture fetching An algorithm for fast texture fetching CUDA Programming and Performance	12	14966	December 19, 2013
Linear interpolation with integer texture. CUDA Programming and Performance	6	2700	August 12, 2022
Why there is no cudaBindTexture3D? It would be nice to have this ... CUDA Programming and Performance	9	4665	December 12, 2009
An array of texture references? CUDA Programming and Performance	30	29721	October 29, 2007
Accelerated Filtering CUDA Programming and Performance	12	109	September 9, 2024
new: cubic interpolation in CUDA cubic B-spline interpolation CUDA Programming and Performance	19	43113	July 20, 2023
Repeated 1D interpolation with type promotion CUDA Programming and Performance	3	570	October 12, 2021
Some questions about texture memory CUDA Programming and Performance	8	1548	March 5, 2019

3D Geographic Interpolation too inaccurate How to best deal with poor texture interpolation?

Related topics