Good programming practice in inlining a device function

If the compiler fails to inline a device function, what I usually do is to 1) put the function definition in the head file 2) put inline key word in front of the function 3) include the head file in the same compiling unit as the kernel function. This works good when the function is not complicated. However, when the device function is too complicated/long, the head file looks ugly. I am wondering if there is any better way of doing inlining for complicated/long device functions.

__forceinline__ should be honoured:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#noinline-and-forceinline

What benefit do you anticipate (or realize) from inlining “complicated/long device functions”? Do you use separate compilation, and if so, do you use link-time optimization?

To first order, inlining a function and moving it to a header file seem orthogonal issue to me. What is the connection between the two in your scenario?

If I want to inline a device function that the compiler refuses to inline, I add the __forceinline__ attribute. I have not used that in years as the compiler does a good job of identifying functions that should be inlined for performance gain.

For the application code I am working on recently, I’ve found that inlining these complicated functions reduces register usage and thus improves performance significantly. I did use separate compilation. What link-time optimizations do you refer to for nvcc. I know Intel compiler has -ipo option to help inline for cross units compilation. Does nvcc has something similar as -ipo?

What I should really ask is does nvcc provide option for enabling inlining for compilation across units?

Yes, it does (although arguably not quite as sophisticated yet given that Intel has polished their IPO functionality for the past 20 years or so). For an introduction, see:

Yes, see blog entry linked above:

In device LTO mode, we store a high-level intermediate form of the code for each translation unit, and then at link time we merge all those intermediates to create a high-level representation of all the device code. This enables the linker to perform high-level optimizations like inlining across file boundaries, which not only eliminates the overhead of the calling conventions, but also further enables other optimizations on the inlined block of code itself.

Many thanks! This is exactly what I am looking for.

Btw, I checked the CUDA 11.0, both -dc and -dlto options are not there for nvcc yet. The link article says it is provided as preview in 11.0.

Note that the blog post refers to CUDA 11.2. The current CUDA version is 11.5. You may want to upgrade to that.

1 Like