Streaming Concurrent Kernels (in Fermi GPUs) ...

samarawickrama · May 5, 2013, 12:36pm

Hi,

In Fermi, it says there can be 16 streaming concurrent kernels possible.

What will happen, if we launch more than 16 (say 100) concurrent kernels as streams and resources are not enough? Will GPU perform all the kernels anyway without concurrency or will it discard (not perform) the excess streams?
What is the basic entity which execute a streaming kernel? (is it a streaming multiprocessor?)
When a streaming kernel is executed, is it fully utilizing a streaming multiprocessor (SM)?
If the number of streams running concurrently is less than the number of streaming multiprocessors (SMs), will concurrent stream instructions be scheduled on different SMs?

Thank you.

Uncle_Joe · May 6, 2013, 8:13pm

If you launch > 16 concurrent kernels, it will keep scheduling kernels up until there are 16 running concurrently or until there are no longer enough resources (i.e. registers, shared mem) to schedule another one. Then as soon as kernels finish, more kernels can be launched.
This seems to be a metaphysical question and I’m not a philosopher. To me, the whole GPU is needed.
This completely depends on the amount of resources your kernel needs (e.g. registers, shared mem, threads)
It’s not clear how thread blocks are assigned to SMs, but I know at least the idle SMs are filled as soon as possible. If you really want to find out the scheduling behavior, you can log the %smid register in each of your thread blocks to see which SM is being used.

To save you some potential trouble that I’ve had, there’s one very unintuitive behavior for overlapping kernels on Fermi. Basically, there’s a hardware limitation (1 issue queue) such that a kernel cannot start until all previous ones have started (see the Kepler white paper’s explanation of HyperQ). This can artificially prevent streams from being executed concurrently. Therefore, a best practice on Fermi is to issue kernels in breadth-first order (issue all parallel kernels before issuing dependent kernels).

Hope that helps

samarawickrama · May 7, 2013, 4:24am

Thank you for your reply.

Topic		Replies	Views
Fermi streams and kernels CUDA Programming and Performance	5	1804	July 22, 2010
Concurrent kernel execution on Fermi How to implement it effectively ? CUDA Programming and Performance	1	590	October 14, 2010
Maximum concurent kernels For numbers of streams > 16 CUDA Programming and Performance	0	943	April 8, 2011
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13202	January 22, 2010
How many streams should I use for concurrent kernels? CUDA Programming and Performance	6	4353	September 3, 2010
Easiest way to invoke two different kernels simultaneously ? CUDA Programming and Performance	4	5756	April 12, 2012
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10609	June 26, 2012
Distinct Kernels on Concurrent Streams? CUDA Programming and Performance	3	1210	June 9, 2009
How many cuda streams can be launched at a time CUDA Programming and Performance	3	12173	December 14, 2011
I have the following conceptual questions : CUDA Programming and Performance	6	705	August 15, 2017

Streaming Concurrent Kernels (in Fermi GPUs) ...

Related topics