When I first read that Fermi would be able to schedule up to 16 kernels for execution concurrently I immediately wondered whether this would finally allow us to invoke a grid INSIDE kernel code. I very very much hope this is the case, along with the new support for memory allocation to take place in kernel code this would be extremely powerful.
The advantages of this are obvious with concurrent kernel execution, as this would allow programmers to efficiently contract or expand kernels to a new number of threads while in flight, WITHOUT breaking optimum work distribution across the SMs or wasting a lot of instructions on loops.
You don’t have to use a lot of imagination to think of a lot of general purpose situations where this would be extremely useful, and I can think of many situations in a DirectX11 style graphics pipeline that this would be extremely useful as well. You could start with massively parallel surface level T&L and culling, into massively parallel vertex T&L and culling, into massively parallel pixel rasterization without EVER going back to the CPU after the first and original call, and ALL the SMs would always stay active, and you could avoid almost any loop instructions.