Overhead of invocation of different kernels are multiple kernels cached somehow?

If I keep calling different kernels in alternating sequence, are the various kernel invocations cached (given that parameters like block size and such stay same)? If they are cached, how many different kernels can I call sequentially before this cache spills?

The underlying question is whether splitting up a complex work task into multiple smaller kernels hurts performance because each switch to a new kernel will incur the full (first time) invocation overhead.


See yesterday’s post on this topic: http://forums.nvidia.com/index.php?showtopic=73698&hl=