speed of device and global functions

Good time of a day! I am writing the program, which has heavy computations in several global functions that work in a loop. After using nvprof it is shown that the launching of global functions takes much more time than computations. So, the question is the next: may be it is better to use several device functions inside one global function, that launches them, instead of call several global functions from host?

This is the multi vs mega kernel dilemma…
To one point it will be faster to call device functions instead of global ones.
After that you might start suffer from insufficient register memory and you will start use global memory instead of registers (using nvcc -v (verbose) will tell you when this has happened).
Even after that, it might happen that using the global memory instead of registers is faster compared to global functions calls (in fact at least from what i’ve seen, it is not the price of the global functions that is expensive. It is what the compiler can do what it can see more of the code in one place, because of the fact that the device function can be inlined and so on).
Testing to see what works best is always the answer, however (because as usual, “it depends”).