Hey guys,
I’m currently continuing to develop a protein-protein docking program running on the GPU in course of my masterthesis (this is the repository if you are interested: GitHub - UmbrellaSampler/gpuATTRACT_2.0)
Background:
In this program the force gradients and energies are precalculated and stored in a grid. During the simulation this grid is loaded as a 3D-texture and the force gradients/energies are interpolated between the grid-points.
So far this was done via the hardware accelerated interpolation of neighboring gridpoints in the texture. However it turns out that this results in an error of about 1.8 promille, which is to high.
After switching to software interpolation the error went down significantly but the runtime increased by a factor of ~4, which makes me sad - alot!
Questions:
Which is the fastest way to do software based interpolation on a 3D grid:
Since my guess is that the bottleneck is memory access-> How does the memory have to be allocated to ensure the fastest access to neighboring gridpoints?
Is it faster to use global memory of 3Dcudaarrays ( if there is any difference at all)?
How do I implement trilinear interpolation in the best possible way?
I’ve tried to do it using fma(https://devblogs.nvidia.com/lerp-faster-cuda/) which actually turns out to be a little bit slower than using conventional lerp.
This is my current implementation:
(the filter mode of grid.texArrayPt is cudaFilterModePoint)
template
host device
forceinline T lerp(T v0, T v1, T t) {
return t*v0+(1-t)*v1;
}
host device
forceinline float4 lerp4f(float4 v0, float4 v1, float t) {
return make_float4( lerp(v0.x, v1.x, t), lerp(v0.y, v1.y, t), lerp(v0.z, v1.z, t), lerp(v0.w, v1.w, t) );
}
template
host device
forceinline float4 interpolate( const d_IntrlpGrid& grid,unsigned const type, REAL x, REAL y, REAL z, unsigned const i){
x = (x - grid.minDim.x) * grid.dVox_inv;
y = (y - grid.minDim.y) * grid.dVox_inv;
z = (z - grid.minDim.z) * grid.dVox_inv;
unsigned const idxX = (unsigned) floor(x);
unsigned const idxY = (unsigned) floor(y);
unsigned const idxZ = (unsigned) floor(z);
REAL const a = x - (REAL)idxX;
REAL const b = y - (REAL)idxY;
REAL const c = z - (REAL)idxZ;
float4 data[2][2][2];
data[0][0][0] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY, idxZ);
data[0][0][1] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY, idxZ + 1);
data[0][1][1] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY + 1, idxZ + 1);
data[0][1][0] = tex3D<float4>(grid.texArrayPt[type], idxX, idxY + 1, idxZ);
data[1][1][0] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY + 1, idxZ);
data[1][1][1] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY + 1, idxZ + 1);
data[1][0][1] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY, idxZ + 1);
data[1][0][0] = tex3D<float4>(grid.texArrayPt[type], idxX + 1, idxY, idxZ);
float4 result = lerp4f(
lerp4f(
lerp4f(data[0][0][0],data[0][0][1],c),
lerp4f(data[0][1][0],data[0][1][1],c),
b),
lerp4f(
lerp4f(data[1][0][0],data[1][0][1],c),
lerp4f(data[1][1][0],data[1][1][1],c),
b),
a);
return result;
}
I’d be really grateful for any suggestions or existing implementations and tips:) thanks in advance and all the best,
Glenn