Benefits of using MACRO vs device inline function

Dear experts,

I’m using the following macro and sub-macros expand differently for fp32 and __half2.

define COST_FUNCTION(q, r1, l, t, d)
FMA(FMA(SUB(r1, q), SUB(r1, q), FLOAT2HALF2(0.0f)), FLOAT2HALF2(1.0f),
FIND_MIN(l, FIND_MIN(t, d)))

Nsight compute shows this as the most executed instruction under source counters which makes sense for my program. Kernel seems highly optimized and is compute bound. I’m wondering if there is any downside to using macro vs a device inline function for the same. I’m thinking macro is the best solution for my task. Am I wrong?

Thank You.

Define “best”. You also have not shown how the macro is expanded, or in which context it is used.

Speaking in generalities, macros are a legacy mechanism inherited from C, and in C++ there are often more robust ways of accomplishing the purposes of performance and flexibility for which macros were used in C, e.g. inline functions and templated functions.

If you are concerned about performance, per your profiling results, look at the computation generated from the macro after expansion: (1) can all or some of this computation be avoided? (2) are there cheaper ways of performing the same computation? Answering these questions requires knowing the context in which this is used.

Thank You very much. (1) and (2) are very good points. I have tried to do that to some extent based on the instructions and by throttling between the various hardware pipes by re-formulating my formula (finding the right balance between FMAs and ADDs)
FIND_MIN is min(), SUB expands to fp32/fp16 subtraction, FMA expands to fused-multiply-add for fp32 or fp16.
I was thinking in terms of invocation overhead. From C++, I believe inline functions would generate a call stack while macros do not?

inline should not generate a call stack (if it is inlined)

1 Like

The whole point of inline functions is to avoid function call overhead. If you are still skeptical regarding potential performance impact, you could always run a quick experiment.

Note that in general, the CUDA toolchain will inline many functions automatically, and if link-time optimization is enabled, may even do so across the boundaries of compilation units.

1 Like