Trilinear Interpolation without Textures

Hey guys,

I’m currently continuing to develop a protein-protein docking program running on the GPU in course of my masterthesis (this is the repository if you are interested: https://github.com/UmbrellaSampler/gpuATTRACT_2.0)

Background:
In this program the force gradients and energies are precalculated and stored in a grid. During the simulation this grid is loaded as a 3D-texture and the force gradients/energies are interpolated between the grid-points.
So far this was done via the hardware accelerated interpolation of neighboring gridpoints in the texture. However it turns out that this results in an error of about 1.8 promille, which is to high.
After switching to software interpolation the error went down significantly but the runtime increased by a factor of ~4, which makes me sad - alot!

Questions:
Which is the fastest way to do software based interpolation on a 3D grid:
Since my guess is that the bottleneck is memory access-> How does the memory have to be allocated to ensure the fastest access to neighboring gridpoints?
Is it faster to use global memory of 3Dcudaarrays ( if there is any difference at all)?
How do I implement trilinear interpolation in the best possible way?

I’ve tried to do it using fma(https://devblogs.nvidia.com/lerp-faster-cuda/) which actually turns out to be a little bit slower than using conventional lerp.

This is my current implementation:

(the filter mode of grid.texArrayPt is cudaFilterModePoint)

template
host device
forceinline T lerp(T v0, T v1, T t) {
return t*v0+(1-t)*v1;
}

host device
forceinline float4 lerp4f(float4 v0, float4 v1, float t) {
return make_float4( lerp(v0.x, v1.x, t), lerp(v0.y, v1.y, t), lerp(v0.z, v1.z, t), lerp(v0.w, v1.w, t) );
}

template
host device
forceinline float4 interpolate( const d_IntrlpGrid& grid,unsigned const type, REAL x, REAL y, REAL z, unsigned const i){

x = (x - grid.minDim.x) * grid.dVox_inv;
y = (y - grid.minDim.y) * grid.dVox_inv;
z = (z - grid.minDim.z) * grid.dVox_inv;

unsigned const idxX = (unsigned) floor(x);
unsigned const idxY = (unsigned) floor(y);
unsigned const idxZ = (unsigned) floor(z);

REAL const a = x - (REAL)idxX;
REAL const b = y - (REAL)idxY;
REAL const c = z - (REAL)idxZ;
float4 data[2][2][2];
data[0][0][0] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY, idxZ);
data[0][0][1] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY, idxZ + 1);
data[0][1][1] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY + 1, idxZ + 1);
data[0][1][0] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY + 1, idxZ);
data[1][1][0] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY + 1, idxZ);
data[1][1][1] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY + 1, idxZ + 1);
data[1][0][1] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY, idxZ + 1);
data[1][0][0] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY, idxZ);

float4 result =	lerp4f(
				lerp4f(
					lerp4f(data[0][0][0],data[0][0][1],c),
					lerp4f(data[0][1][0],data[0][1][1],c),
					b),
				lerp4f(
					lerp4f(data[1][0][0],data[1][0][1],c),
					lerp4f(data[1][1][0],data[1][1][1],c),
					b),
				a);
return result;

}

I’d be really grateful for any suggestions or existing implementations and tips:) thanks in advance and all the best,
Glenn

You could also attempt to use a layered 2D array instead of a 3D texture. Perform bilinear interpolation in each layer first then do a final lerp between the two layer results.

Christian

For your basic lerp, I would suggest using one based on just two FMAs:

https://devblogs.nvidia.com/lerp-faster-cuda/

The accuracy you observe with the hardware interpolation is as expected, because the hardware uses low-precision 1.8 bit fixed-point arithmetic for the interpolation, as detailed in an appendix of the CUDA Programmer Guide.

Thank you guys for the fast responds!

@Christian: that sounds like a good idea!

@njuffa: I already implemented the lerp with two fma’s. It turned out to be slower, as I already mentioned. I also entered both function versions in godbolt and it seems that the fma-version has more lines of PTX code.

Glenn

Sorry, I overlooked that particular line of your original post. I have to say I am surprised by that, as I cannot think of any arrangement of the lerp math that results in fewer than two FP32 operations. There may be secondary scheduling or register-pressure effects.

Note that PTX code should under almost no circumstances be used in code efficiency analysis, as PTX is merely an intermediate code representation that is compiled into machine code (SASS) by an optimizing compiler (ptxas) at the back end of the CUDA tool chain. Therefore code analysis should look at SASS, e.g. from running cuobjdump --dump-sass on the executable.

I haven’t worked through the details of trilinear interpolation, but a basic idea to reduce the number of memory accesses (in addition to cbuncher’s layering suggestion) would be to handle the 2D interpolation for several neighboring points at once, so neighbor data does not have to be loaded redundantly. This could involve retrieving entire tiles and storing them in shared memory.

[Later:] An approach that should be very efficient in terms of memory accesses would, at any moment in time, cache three layers of a 2D (x,y) tile in z direction, perform the necessary interpolations, then discard the bottom-most layer while retrieving a new top layer. This cyclic buffer approach is explained in the following paper:

Everett H. Phillips and Massimiliano Fatica, “Implementing the Himeno Benchmark with CUDA on GPU Clusters.” In 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1-10

I was surprised myself, since the post is explicitly stating that it’s faster. Thanks, I was already half way curious how runtime and PTX-length correlate.

Concerning shared memory:
The problem that unfortunately exists is that each thread is assigned to an atom and calculates the forces (gridbased) that are acting on it. This means that concurrent threads are likely to handle atoms that close in space, but that not a must. Thus I cannot easily load the gridpoints into shared memory and interpolate between neighboring gridpoints.
To do this I would have to assign each thread to each gridpoint and cluster all atoms in this region, I guess.

But maybe you are right and I could just load each gridpoint needed into shared memory, hoping that atoms share neighbors in the grid.

I’ll take a look at the paper tomorrow. I’m done for today ;)

all the best from munich
Glenn

In many situations data can be re-arranged efficiently during the move from global to shared memory. I don’t have sufficient insight into this use case to say whether that is possible here. You may also want to look into publications that deal with similar use cases; maybe particle-in-cell scenarios would be of interest?

Make sure to use the CUDA profiler to guide the memory optimization process, which will likely be a multi-stage affair.

I have read the link, but should fma(t, v1 -v0, v0) be better or at least equivalent?

(1) Less accurate
(2) Not faster (still two dependent floating-point instructions)

I looked at my framework that I used to evaluate different styles of lerp, and found that I evaluated the variant proposed in #8 along with others prior to the blog post I linked to above.

To illustrate the accuracy differences, I ran a largish number of single-precision test cases through the two lerp versions. Here are the worst-case test vectors reported:

Feng_lerp:  v0= 6.01511896e-001  v1= 1.35239959e-003  t= 9.99875069e-001  ulp=256.364
Blog_lerp:  v0= 1.69147059e-001  v1= 1.69155478e-001  t= 4.40470593e-008  ulp=0.999975

The maximum error of the lerp version from the blog is generally < 1 ulp, i.e. it produces faithfully rounded results. It is not monotonic, however.

Thanks, I understand now.