Performance degradation in highly templated code

Hi,
I’m working on a few features for hoomd (gritty details: https://github.com/MartinGirard/hoomd-blue/blob/ExtendedAlchemicalCalculations/hoomd/md/PotentialPairGPU.cuh). The simplest implementation requires a refactoring of cuda kernels, with an extra layer of template. The current implementation in hoomd makes use of a functor, so that the kernel looks like:

template<class F>
__global__ kernel(args ...){
F eval;
eval.evaluate(args);
}

For my purposes, I need to refactor this into what is essentially:

template<class F, class G>
__global__ kernel(args ...){
G eval;
eval.evaluate<F>(args);
}


struct G{
template<class F>
void evaluate(args){
F eval;
eval.evaluate(args);
}
}

There is a particular functor for which this degraded performance by about 30%. Its internals are:

struct ProblemFunctor{
evaluate(r, p1, p2){
auto var1 = mypow(r,p1);
auto var2 = mypow(r,p2); 
}
}

double inline mypow(double a, double b){
return exp(b * log(a));
}

After digging through the generated SASS, I’ve eventually found out that the power in the initial implementation generates 112 operations, and the second 175. This can be fixed by using instead:

struct ProblemFunctor{
evaluate(r, p1, p2){
auto rlog = log(r);
auto var1 = exp(rlog * p1);
auto var2 = exp(rlog * p2); 
}

Which implies that the compiler (nvcc 11.4) does not run the same amount of optimization on the extra templated version. What would produce this behavior, and can I force compilers to produce correct inlining here?

are you compiling with -rdc=true?

when you dig through the SASS, is there evidence in each case that the mypow function is inlined, or is there evidence in each case that there is an explicit mypow function?

Depending on the answers to those questions, you might want to see if the “recently” introduced link time optimization may have an effect (or are you already doing that?)

Nothing is handled incorrectly here. The inline specifier is a merely a hint to the compiler. From the ISO C++11 standard:

7.1.2 Function specifiers
[…]
A function declaration (8.3.5, 9.3, 11.3) with an inline specifier declares an inline function. The inline specifier indicates to the implementation that inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism. An implementation is not required to perform this inline substitution at the point of call; however, even if this inline substitution is omitted, the other rules for inline functions defined by 7.1.2 shall still be respected.

The CUDA compiler already inlines functions aggressively, but there are presumably code size limits to inlining, and my_pow() may exceed such a limit after inlining of exp() and log(). Even if my_pow() actually gets inlined, performing common subexpression elimination between two inlined my_pow() instances may not work quite as well as performing the elimination of common subexpressions by hand at a higher level. Whether any of these hypotheses apply only a detailed analysis of the generated machine code (SASS) can tell.

You could try using __forceinline__, which is a compiler-specific extension. Not sure whether it is supported for class members, and under which conditions is actually forces inlining. Consult the documentation.

I tried enabling LTO, but there was no change in performance. Using __forceinline__ results in an absence of any CALL in the SASS, but somehow still does not result in elimination of the common log subexpression.

I am now terribly confused as to how the extra templates hinders detection of common expressions if everything is inlined, without any function call.

I already speculated about that in the penultimate paragraph of my previous post. The order in which inlining and CSE (common subexpression elimination) happens could make a difference. Your manual application of log(r); var1 = exp(rlog * p1); var2 = exp(rlog * p2); is basically a very high-level form of CSE that happens early. I don’t think this is how the compiler transforms the code . I would think it first inlines from the bottom up (so starts by inlining exp and log inside my_pow), then it applies CSE to the resulting jumble of hundreds of instructions. You may also run into an issue of code size limitations inhibiting certain optimizations. I have not analyzed this case, so I could be wrong about this.

You could consider filing an enhancement request with NVIDIA via the bug reporting form. I am not a compiler engineer, so I don’t know what is realistically possible here beyond the optimizations the CUDA compiler already applies. You are already using the latest toolchain 12.3 update 1, and are looking at the performance of a release build with full optimization, correct?