NVCC can't inline device code across compilation units - workarounds? feature request?

What could be the reason that nvcc can’t inline device functions across compilation units? I’ve seen this have a dramatic effect on speed, at least a factor of two. It makes it hard to organize/factor large programs in a traditional manner - one is forced to compile all kernels and device functions into a single unit.

How do people deal with this? Or, is this a planned feature for future releases of CUDA? If not, how would one request it?

This is not an issue specific to the CUDA toolchain. In order to inline, the definition of a functions needs to be available. If the function is in a different compilation unit, only its interface is visible during compilation, not its implementation.

Inlining across different compilation units is something only compilers with an optimizing linker stage support. That is, the linker provides for whole program optimization, usually by working with an intermediate code representation stored in a database that comprises all object modules that are part of the executable. The CUDA toolchain’s device-code linker does not support such functionality at present.

Without an optimizing linker, you need to make the definition of such functions visible during the compilation stage. Stick your static __forceinline__ functions into a file that is #included by all relevant compilation units (of which you can have as many as you want for your modular design).

You can submit feature requests to NVIDIA by filing a bug report and prefixing the synopsis with “RFE:” to mark it as an enhancement request rather than a report for a functional bug.