The way __device__ funcs get allocated

Sorry for such a stupid question but never called device funcs before.
Searched through CUDA documentation but was unable to find the answer for the question of actual allocation technique for device functions.
Consider I run my program on GPU with 27 multiprocessors with only 16 of them allocated to kernels. I’m calling device function from inside of every kernel. So how does the CUDA allocates device resources to these device calls (in terms of multiprocessors)? External Image

It doesn’t. All device functions are in-lined by the compiler.

So what you are saying is that all these device functions just executed as part of the kernel called them, that is, on the same multiprocessor where they were called from? As include directive dictates compiler to replace itself (include) with the content of included file. Is that right?

It doesn’t exactly work that way, but that is the net effect, yes.