When I first read that Fermi would be able to schedule up to 16 kernels for execution concurrently I immediately wondered whether this would finally allow us to invoke a grid INSIDE kernel code. I very very much hope this is the case, along with the new support for memory allocation to take place in kernel code this would be extremely powerful.
The advantages of this are obvious with concurrent kernel execution, as this would allow programmers to efficiently contract or expand kernels to a new number of threads while in flight, WITHOUT breaking optimum work distribution across the SMs or wasting a lot of instructions on loops.
You don’t have to use a lot of imagination to think of a lot of general purpose situations where this would be extremely useful, and I can think of many situations in a DirectX11 style graphics pipeline that this would be extremely useful as well. You could start with massively parallel surface level T&L and culling, into massively parallel vertex T&L and culling, into massively parallel pixel rasterization without EVER going back to the CPU after the first and original call, and ALL the SMs would always stay active, and you could avoid almost any loop instructions.
I believe the CPU is still heavily involved at the driver level to provide the GPU with work. The CPU is probably even involved in distributing the thread blocks to the individual multiprocessors (the hardware scheduler operates at the multiprocessor level). So launching a kernel from within a kernel sounds a bit unlikely, considering what I just said.
The whitepaper for Fermi states that the scheduling system actually operates at both levels in hardware now; both at the SM level and at the chip level to assign kernels and threadblocks to SMs.
Fermi is expected to run on hardware that will also support dx11 api, and that api has a command named DrawIndirect, which basically schedules a rendering call, but takes input parameters for a call from a buffer. It could be that Fermi will also have something like this, where you schedule a call, but provide the thread configuration and parameters in a buffer generated by another kernel.
You seemd to have missed the “Note to self” and took it as an offense.
Just to make sure that you don’t misunderstand this: I was the one who made the unqualified comment, ok?
This was me pointing out to myself that maybe next time I should read a whitepaper on things before commenting on things that I haven’t read the whitepaper about yet.
Yes, I will read the whitepaper, but not now. Thanks for the link.
May be, this scheduling resulted from the “lack of dynamic block scheduling” feedback given in the forums… I dont think it would be related to kernel launching kernels… But its jusss my guess…
Graphics and game work scheduling tends to have a wide and varying parallel workload, which changes dynamically and has significant interdependencies, and on top of it all it’s realtime so efficient task scheduling is critical.
Multiple kernels (with kernels launching kernels) is pretty much required. The question is how that support will work, especially if there’s higher level locks or events for coordinating launches, like “after the cloth sim and the water spray are BOTH computed, launch the first stage of graphics setup based on that geometry.”
An excellent example is page 7 of the DICE presentation at the 09 SIGGRAPH course Beyond Programmable shading. Look at the work queue for the multiple computations needed for a single frame of game graphics and all those task interdependencies… lots of parallel opportunities, but only if you can coordinate those dependencies easily (and without hacks like polling.) Using the CPU to coordinate it all would also be challenging.
It;s not quite clear from the Fermi whitepaper how the scheduling options will be exposed to CUDA… we’ll see soon enough.
Even with launch latency decreased it would seem stupid to integrate instructions for something like memory allocation inside kernel code which basically in the same kind of mentality without also integrating instructions for launching new kernels from a kernel. Think about how much less geometry shaders (especially those that emit large ratios of output to input geometry) would suck in implementation if the driver used this as a hardware feature to keep the entire pipeline running without going back to the CPU.