As far as I know, the current limit on the size of cuda kernels is 2million PTX instructions which would round off to about 2 MB.
I am trying to create an application which would do a lot of work on the GPU and the kernel would be quite large. I intend to use functions to distribute the work (just like anyone on this planet). I read somewhere that function calls are expanded inline. Is that true? If yes, then will all my device functions be part of a one huge piece of code and result in a massive kernel (or possibly an error!)? Or are the device function calls actually follow the regular stack push-pop on GPU as they do for the CPU?