C++ class member __device__ functions, where to qualify as __forceinline__ ?

Working on an existing CUDA project which uses C++ classes for everything and has hundreds of member device functions within the classes used.

I usually use helper functions to dissect any C++ class down to primitive (or aligned CUDA) types so kernels can operate on primitives, but in this case I must preserve ‘readability’ , and cannot make such changes.

So for a given class which will exist on both the host and device, and has literally hundreds of simple member device functions, where is the correct place to use forceinline ?

In other words can I leave the class declarations as just

class Bar{
__device__ void foo();


in the .h file, then in the .cu file qualify in the definition like

__device__  __forceinline__ void Bar::foo(){..}


That above compiled, and I also tried qualifying the same way in the .h file, and that seemed ok as well, but there was an pre-existing code comment warning against the inline qualifier in the class declarations.

What would be the prudent way of guaranteeing all these damn class member functions actually end up inlined in device code? They are all small and usually called in a nested fashion.

Note: the class member functions I am talking about will strictly be device and not called from host

Don’t most compilers inline automatically now anyway? Or is the CUDA compiler too infant for that?

I do not know, I always assumed that the compiler would inline the device functions, but I have noticed situations where such device functions when called multiple times in a loop (or a series of nested device functions) have benefited from being forced inline.

I am more of a C guy than C++, which is why I ask.