This is not an issue specific to the CUDA toolchain. In order to inline, the definition of a functions needs to be available. If the function is in a different compilation unit, only its interface is visible during compilation, not its implementation.
Inlining across different compilation units is something only compilers with an optimizing linker stage support. That is, the linker provides for whole program optimization, usually by working with an intermediate code representation stored in a database that comprises all object modules that are part of the executable. The CUDA toolchain’s device-code linker does not support such functionality at present.
Without an optimizing linker, you need to make the definition of such functions visible during the compilation stage. Stick your
static __forceinline__ functions into a file that is #included by all relevant compilation units (of which you can have as many as you want for your modular design).
You can submit feature requests to NVIDIA by filing a bug report and prefixing the synopsis with “RFE:” to mark it as an enhancement request rather than a report for a functional bug.