Working on an existing CUDA project which uses C++ classes for everything and has hundreds of member device functions within the classes used.
I usually use helper functions to dissect any C++ class down to primitive (or aligned CUDA) types so kernels can operate on primitives, but in this case I must preserve ‘readability’ , and cannot make such changes.
So for a given class which will exist on both the host and device, and has literally hundreds of simple member device functions, where is the correct place to use forceinline ?
In other words can I leave the class declarations as just
class Bar{
public:
__device__ void foo();
};
in the .h file, then in the .cu file qualify in the definition like
__device__ __forceinline__ void Bar::foo(){..}
?
That above compiled, and I also tried qualifying the same way in the .h file, and that seemed ok as well, but there was an pre-existing code comment warning against the inline qualifier in the class declarations.
What would be the prudent way of guaranteeing all these damn class member functions actually end up inlined in device code? They are all small and usually called in a nested fashion.
Note: the class member functions I am talking about will strictly be device and not called from host