Can too many kernel calls affect the performance of the algorithm although all the variable are stored in the device ?
A kernel launch has some overhead, so if you chop your work down into smaller pieces than necessary, it will take a little longer to finish the whole job. Do you have a specific example in mind?
Depends. There is a fixed cost for launching a kernel (which is quite high on Windows Vista/7), so given the choice you should try to combine kernels where possible.
With references to you earlier thread, perhaps even more problematic that the fixed overheads associated with kernel launches is the cost of device memory reads. Splitting up operations which could otherwise be combined into a single kernel will require reading and writing data from global memory more times than necessary to complete the calculation. In low Flop count, memory bandwidth limited calculations, that is a recipe for poor performance. The more Flops you can perform per memory transaction, the better the overall performance would be.
On a typical high end card, you have better than 600-1000 GFlop/s single precision peak available, and only about 25-40Gfloat/s memory bandwidth to get data in and out of the processors that deliver them. It isn’t hard to see why anything you do to improve the Flop/s per transaction ratio will yield big improvements in performance. And conversely anything you do to reduce the ratio (like artificially breaking up calculations and spreading them over many kernels) will hurt performance.
An example for an implementation that uses extremely short kernels and suffers from kernel launch overhead is bbsort (http://forums.nvidia.com/index.php?showtopic=98813). In my experiments with bbsort the performance difference between Vista and Linux was about 30%.