Fermi speculation Kernel invocation in kernel code

parlance · October 18, 2009, 2:41pm

When I first read that Fermi would be able to schedule up to 16 kernels for execution concurrently I immediately wondered whether this would finally allow us to invoke a grid INSIDE kernel code. I very very much hope this is the case, along with the new support for memory allocation to take place in kernel code this would be extremely powerful.

The advantages of this are obvious with concurrent kernel execution, as this would allow programmers to efficiently contract or expand kernels to a new number of threads while in flight, WITHOUT breaking optimum work distribution across the SMs or wasting a lot of instructions on loops.

You don’t have to use a lot of imagination to think of a lot of general purpose situations where this would be extremely useful, and I can think of many situations in a DirectX11 style graphics pipeline that this would be extremely useful as well. You could start with massively parallel surface level T&L and culling, into massively parallel vertex T&L and culling, into massively parallel pixel rasterization without EVER going back to the CPU after the first and original call, and ALL the SMs would always stay active, and you could avoid almost any loop instructions.

cbuchner1 · October 19, 2009, 11:58am

Hmm, I don’t think this is possible, here’s why.

I believe the CPU is still heavily involved at the driver level to provide the GPU with work. The CPU is probably even involved in distributing the thread blocks to the individual multiprocessors (the hardware scheduler operates at the multiprocessor level). So launching a kernel from within a kernel sounds a bit unlikely, considering what I just said.

parlance · October 19, 2009, 2:48pm

The whitepaper for Fermi states that the scheduling system actually operates at both levels in hardware now; both at the SM level and at the chip level to assign kernels and threadblocks to SMs.

cbuchner1 · October 19, 2009, 3:24pm

Note to self: Read whitepapers before making unqualified comments.

Thanks ;)

sergeyn · October 19, 2009, 3:58pm

Fermi is expected to run on hardware that will also support dx11 api, and that api has a command named DrawIndirect, which basically schedules a rendering call, but takes input parameters for a call from a buffer. It could be that Fermi will also have something like this, where you schedule a call, but provide the thread configuration and parameters in a buffer generated by another kernel.

parlance · October 19, 2009, 9:22pm

Are you illiterate perhaps?

GigaThread Scheduler

[i]One of the most important technologies of the Fermi architecture is its two-level, distributed

thread scheduler. At the chip level, a global work distribution engine schedules thread blocks

to various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to

its execution units.[/i]

That’s on page 18 of the whitepaper at http://www.nvidia.com/content/PDF/fermi_wh…eWhitepaper.pdf

cbuchner1 · October 19, 2009, 9:46pm

You seemd to have missed the “Note to self” and took it as an offense.

Just to make sure that you don’t misunderstand this: I was the one who made the unqualified comment, ok?

This was me pointing out to myself that maybe next time I should read a whitepaper on things before commenting on things that I haven’t read the whitepaper about yet.

Yes, I will read the whitepaper, but not now. Thanks for the link.

parlance · October 19, 2009, 10:28pm

Sorry, the CUDA forum winky emoticon just looks so sarcastic :(

Sorry.

Sarnath · October 20, 2009, 4:48am

May be, this scheduling resulted from the “lack of dynamic block scheduling” feedback given in the forums… I dont think it would be related to kernel launching kernels… But its jusss my guess…

SPWorley · October 20, 2009, 5:01am

Graphics and game work scheduling tends to have a wide and varying parallel workload, which changes dynamically and has significant interdependencies, and on top of it all it’s realtime so efficient task scheduling is critical.
Multiple kernels (with kernels launching kernels) is pretty much required. The question is how that support will work, especially if there’s higher level locks or events for coordinating launches, like “after the cloth sim and the water spray are BOTH computed, launch the first stage of graphics setup based on that geometry.”

An excellent example is page 7 of the DICE presentation at the 09 SIGGRAPH course Beyond Programmable shading. Look at the work queue for the multiple computations needed for a single frame of game graphics and all those task interdependencies… lots of parallel opportunities, but only if you can coordinate those dependencies easily (and without hacks like polling.) Using the CPU to coordinate it all would also be challenging.

It;s not quite clear from the Fermi whitepaper how the scheduling options will be exposed to CUDA… we’ll see soon enough.

parlance · October 20, 2009, 8:15am

Graphics and game work scheduling tends to have a wide and varying parallel workload, which changes dynamically and has significant interdependencies, and on top of it all it’s realtime so efficient task scheduling is critical.

Multiple kernels (with kernels launching kernels) is pretty much required. The question is how that support will work, especially if there’s higher level locks or events for coordinating launches, like “after the cloth sim and the water spray are BOTH computed, launch the first stage of graphics setup based on that geometry.”

An excellent example is page 7 of the DICE presentation at the 09 SIGGRAPH course Beyond Programmable shading. Look at the work queue for the multiple computations needed for a single frame of game graphics and all those task interdependencies… lots of parallel opportunities, but only if you can coordinate those dependencies easily (and without hacks like polling.) Using the CPU to coordinate it all would also be challenging.

It;s not quite clear from the Fermi whitepaper how the scheduling options will be exposed to CUDA… we’ll see soon enough.

Even with launch latency decreased it would seem stupid to integrate instructions for something like memory allocation inside kernel code which basically in the same kind of mentality without also integrating instructions for launching new kernels from a kernel. Think about how much less geometry shaders (especially those that emit large ratios of output to input geometry) would suck in implementation if the driver used this as a hardware feature to keep the entire pipeline running without going back to the CPU.

Topic		Replies	Views
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13203	January 22, 2010
Concurrently kernels running on one device CUDA Programming and Performance	17	2738	March 2, 2010
Can threads in a warp from different blocks? CUDA Programming and Performance	17	11847	March 26, 2010
concurrently running blocks from multiple kernels on the same SM related to Fermi and unified shader CUDA Programming and Performance	14	3877	November 30, 2010
Scheduling on Fermi CUDA Programming and Performance	16	17542	August 9, 2010
Concurrent Kernel Execution on Fermi - confussion CUDA Programming and Performance	13	1657	October 10, 2011
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10619	June 26, 2012
putting multiprocessors in group CUDA Programming and Performance	6	1681	November 27, 2009
Kernels launch - parallel or serial? CUDA Programming and Performance	16	6855	January 11, 2010
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17728	April 1, 2010

Fermi speculation Kernel invocation in kernel code

Related topics