Benefits of using MACRO vs device inline function

harisankarhss · October 27, 2022, 5:50pm

Dear experts,

I’m using the following macro and sub-macros expand differently for fp32 and __half2.

define COST_FUNCTION(q, r1, l, t, d)
FMA(FMA(SUB(r1, q), SUB(r1, q), FLOAT2HALF2(0.0f)), FLOAT2HALF2(1.0f),
FIND_MIN(l, FIND_MIN(t, d)))

Nsight compute shows this as the most executed instruction under source counters which makes sense for my program. Kernel seems highly optimized and is compute bound. I’m wondering if there is any downside to using macro vs a device inline function for the same. I’m thinking macro is the best solution for my task. Am I wrong?

Thank You.

njuffa · October 27, 2022, 6:06pm

Define “best”. You also have not shown how the macro is expanded, or in which context it is used.

Speaking in generalities, macros are a legacy mechanism inherited from C, and in C++ there are often more robust ways of accomplishing the purposes of performance and flexibility for which macros were used in C, e.g. inline functions and templated functions.

If you are concerned about performance, per your profiling results, look at the computation generated from the macro after expansion: (1) can all or some of this computation be avoided? (2) are there cheaper ways of performing the same computation? Answering these questions requires knowing the context in which this is used.

harisankarhss · October 27, 2022, 6:19pm

Thank You very much. (1) and (2) are very good points. I have tried to do that to some extent based on the instructions and by throttling between the various hardware pipes by re-formulating my formula (finding the right balance between FMAs and ADDs)
FIND_MIN is min(), SUB expands to fp32/fp16 subtraction, FMA expands to fused-multiply-add for fp32 or fp16.
I was thinking in terms of invocation overhead. From C++, I believe inline functions would generate a call stack while macros do not?

Robert_Crovella · October 27, 2022, 6:37pm

inline should not generate a call stack (if it is inlined)

njuffa · October 27, 2022, 6:38pm

The whole point of inline functions is to avoid function call overhead. If you are still skeptical regarding potential performance impact, you could always run a quick experiment.

Note that in general, the CUDA toolchain will inline many functions automatically, and if link-time optimization is enabled, may even do so across the boundaries of compilation units.

Topic		Replies	Views
MAX: MACRO or CUDA functions? CUDA Programming and Performance	3	2632	February 7, 2012
calling a __device__ functions inside kernels CUDA Programming and Performance	4	20613	August 16, 2013
Good programming practice in inlining a device function CUDA Programming and Performance	6	3058	October 29, 2021
macro function Is it impossible to use macro functions? CUDA Programming and Performance	9	5616	November 7, 2007
Function Call Overhead CUDA Programming and Performance	7	5274	October 29, 2010
Minimum of 2-5 floats CUDA Programming and Performance	4	2417	March 2, 2008
Fastmath functions Speed or accuracy CUDA Programming and Performance	8	21480	April 16, 2009
max( a, b ) ( ((a) > (b)) ? (a) : (b) ) in cuda? how to use above c function in cuda CUDA Programming and Performance	8	20698	April 2, 2011
min max and sign functions in CUDA do they exist? if so where? CUDA Programming and Performance	6	48313	February 27, 2012
The cost of Relocatable Device Code (-rdc=true) CUDA Programming and Performance	11	15031	September 11, 2023

Benefits of using MACRO vs device inline function

Related topics