I have a 3d grid of geographic points and I am performing interpolation between these.
My first attempts was using 3D linear filtering, but it turns out that the accuracy suffered too much.
So I have converted my kernel to perfrom 8 texture lookups (corners of the cube) using POINT filtering, and
them I perform the standard bilinear interpolation in the kernel.
Unfortunately this reduces the bandwidth of my kernel from about 700M pixels per second to more like 250 Mpp.
Any thoughts how to better this?
I am unsure whether the computation of the texture lookups is responsible for the slowdown. My guess would be the memory accesses.
I’m not clear on what you’re doing.
Texture interpolation IS (quantized) linear so going from software interpolation to hardware interpolation won’t help accuracy.
What do you mean by texture lookups?
You’re not using a 3D texture?
Bilinear? That’s 2D.
hardware linear texture filtering for (my) geographic grids is too inaccurate, but very fast.
linear filtering manually in the shader works (accuracy wise), but slows things down unfavorably.
Any ideas what to do? Is 2) the optimal solution or is there something better?
In 2) I am using (nearest neighbor) texture lookups in the hope of getting help from the memory cache.
I am using tex3D, as my texture is 3D.
Okay, not bilinear but, technically 3 dimensional linear langrangian tensor product interpolation.
And, ok, I admit what I am really asking is whether bending over backwards to implement a kernel that prefetches shared memory is likely to buy me a good speedup…
And open to any ideas I hadn’t thought about…
A fast 3D spline interpolation based on linear interpolation lookups can be used to be more accurate.
And here’s a pretty sweet package for it as well.
Its hard to predict whether prefetching to shared memory will be faster or not. From a bandwidth point of view it should be a big win. The throughput of the texture units is only a fraction of the global memory bandwidth if you only have a few bytes per texel (although its not so bad if you have say a float4 per texel). The downside is that you would have to manage the caching/prefetching and that will probably cost you in instructions and registers.
If you can target Fermi then you can just read directly from global memory and rely on the L1 and L2 caches. You would probably want to use an appropriate layout for your data (some kind of 3D z-curve). This should be much faster than using nearest neighbour texture lookups.
One other tip is that in HLSL (but not CUDA as far as I know) there is an instruction that fetches all 4 texels of a neigbourhood from a 2D image in a single sampling operation. I belive this is designed for exactly what you are trying to do (although only in 2D). Maybe some kind of DirectX interop would work for you.
And thanks fna, I think I understand what you suggest. My final approach is similar in its idea, I think:
I stumbled on a counterintuitive solution that brought back most of my throughput:
At the beginning of the kernel I perform the expensive self-filtered texture lookups and interpolations at the four corners of the geography that corresponds to my block. I store the results in shared memory.
After sync’ing up, in the kernel I now perform arithmetic linear interpolation at each point using the corner posts.
Its the same amount of computation (more, actually), but close to mathematically equivalent to interpolating at each point.
But the number of texture lookups is far less.
I guess the thing I am using here is that I know my geographic grid is much coarser than the size of my block, so my block will be mostly within a cell. For blocks on cell boundaries I will get a slightly different result than the full interpolation (I could be cutting a corner), but likely it will work well enough!!
My understanding is that you get the four neighbours for the price of one lookup instead of four. My guess is that we won’t see 3D versions because of the amount of data that would have to be returned (up to 4 components from 8 different voxels - I don’t think so somehow). But my guess is that you’d probably get some decent results using a stack of 2D images.
It’s very interesting to hear that this feature is coming to CUDA. I wonder if there might be some other texturing improvements on the way too…