Performance degradation in highly templated code

martin.girard · November 29, 2023, 4:34pm

Hi,
I’m working on a few features for hoomd (gritty details: https://github.com/MartinGirard/hoomd-blue/blob/ExtendedAlchemicalCalculations/hoomd/md/PotentialPairGPU.cuh). The simplest implementation requires a refactoring of cuda kernels, with an extra layer of template. The current implementation in hoomd makes use of a functor, so that the kernel looks like:

template<class F>
__global__ kernel(args ...){
F eval;
eval.evaluate(args);
}

For my purposes, I need to refactor this into what is essentially:

template<class F, class G>
__global__ kernel(args ...){
G eval;
eval.evaluate<F>(args);
}


struct G{
template<class F>
void evaluate(args){
F eval;
eval.evaluate(args);
}
}

There is a particular functor for which this degraded performance by about 30%. Its internals are:

struct ProblemFunctor{
evaluate(r, p1, p2){
auto var1 = mypow(r,p1);
auto var2 = mypow(r,p2); 
}
}

double inline mypow(double a, double b){
return exp(b * log(a));
}

After digging through the generated SASS, I’ve eventually found out that the power in the initial implementation generates 112 operations, and the second 175. This can be fixed by using instead:

struct ProblemFunctor{
evaluate(r, p1, p2){
auto rlog = log(r);
auto var1 = exp(rlog * p1);
auto var2 = exp(rlog * p2); 
}

Which implies that the compiler (nvcc 11.4) does not run the same amount of optimization on the extra templated version. What would produce this behavior, and can I force compilers to produce correct inlining here?

Robert_Crovella · November 29, 2023, 4:53pm

are you compiling with -rdc=true?

when you dig through the SASS, is there evidence in each case that the mypow function is inlined, or is there evidence in each case that there is an explicit mypow function?

Depending on the answers to those questions, you might want to see if the “recently” introduced link time optimization may have an effect (or are you already doing that?)

njuffa · November 29, 2023, 6:46pm

Nothing is handled incorrectly here. The inline specifier is a merely a hint to the compiler. From the ISO C++11 standard:

7.1.2 Function specifiers
[…]
A function declaration (8.3.5, 9.3, 11.3) with an inline specifier declares an inline function. The inline specifier indicates to the implementation that inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism. An implementation is not required to perform this inline substitution at the point of call; however, even if this inline substitution is omitted, the other rules for inline functions defined by 7.1.2 shall still be respected.

The CUDA compiler already inlines functions aggressively, but there are presumably code size limits to inlining, and my_pow() may exceed such a limit after inlining of exp() and log(). Even if my_pow() actually gets inlined, performing common subexpression elimination between two inlined my_pow() instances may not work quite as well as performing the elimination of common subexpressions by hand at a higher level. Whether any of these hypotheses apply only a detailed analysis of the generated machine code (SASS) can tell.

You could try using __forceinline__, which is a compiler-specific extension. Not sure whether it is supported for class members, and under which conditions is actually forces inlining. Consult the documentation.

martin.girard · November 30, 2023, 4:00pm

I tried enabling LTO, but there was no change in performance. Using __forceinline__ results in an absence of any CALL in the SASS, but somehow still does not result in elimination of the common log subexpression.

I am now terribly confused as to how the extra templates hinders detection of common expressions if everything is inlined, without any function call.

njuffa · December 1, 2023, 3:08am

I already speculated about that in the penultimate paragraph of my previous post. The order in which inlining and CSE (common subexpression elimination) happens could make a difference. Your manual application of log(r); var1 = exp(rlog * p1); var2 = exp(rlog * p2); is basically a very high-level form of CSE that happens early. I don’t think this is how the compiler transforms the code . I would think it first inlines from the bottom up (so starts by inlining exp and log inside my_pow), then it applies CSE to the resulting jumble of hundreds of instructions. You may also run into an issue of code size limitations inhibiting certain optimizations. I have not analyzed this case, so I could be wrong about this.

You could consider filing an enhancement request with NVIDIA via the bug reporting form. I am not a compiler engineer, so I don’t know what is realistically possible here beyond the optimizations the CUDA compiler already applies. You are already using the latest toolchain 12.3 update 1, and are looking at the performance of a release build with full optimization, correct?

Topic		Replies	Views
Is CUDA C or C++ ? CUDA Programming and Performance	12	33661	January 30, 2009
Using Lambda functions in CUDA with template functions CUDA Programming and Performance	4	2332	September 12, 2016
Faster __vsubus4() implementation CUDA Programming and Performance	3	1242	July 2, 2016
Performance slowdown when moving template parameter to function argument CUDA Programming and Performance	21	2129	September 12, 2018
Good programming practice in inlining a device function CUDA Programming and Performance	6	3263	October 29, 2021
How to Optimize the CUDA Code? CUDA Programming and Performance	7	1119	March 24, 2013
CUDA Pro Tip: Optimize for Pointer Aliasing Technical Blog	13	932	April 11, 2019
Examining the generated .ptx file CUDA Programming and Performance	13	2411	October 24, 2014
CUDA Emulator corrupting 3D array CUDA Programming and Performance	6	7001	July 22, 2010
CUDA - C# interop CUDA Programming and Performance	13	25048	September 17, 2010

Performance degradation in highly templated code

Related topics