How functions are compiled? Are function calls expanded inline or are actually CALLED?


As far as I know, the current limit on the size of cuda kernels is 2million PTX instructions which would round off to about 2 MB.

I am trying to create an application which would do a lot of work on the GPU and the kernel would be quite large. I intend to use functions to distribute the work (just like anyone on this planet). I read somewhere that function calls are expanded inline. Is that true? If yes, then will all my device functions be part of a one huge piece of code and result in a massive kernel (or possibly an error!)? Or are the device function calls actually follow the regular stack push-pop on GPU as they do for the CPU?


On compute capability 1.x functions are always inlined as no call stack exists. On compute capability 2.x it is by default the decision of the compiler. You can declare a function as noinline to prevent inlining. This might prevent code bloat if a device function is used multiple times in a kernel and somehow the compiler still decides to inline it (e.g. because that would open up additional possibilities for optimization).

Just for completeness, there is also forceinline as a counterpart to noinline. Since code size is of concern here, noinline is probably the function attribute of more interest in this case.

@tera; I am with GTX470 and is compute 2.0 capable. relaxes Thanks for the clarification

@njuffa: Thanks for that valueable complement to the previous answer.