Hi,
In Fermi, it says there can be 16 streaming concurrent kernels possible.
-
What will happen, if we launch more than 16 (say 100) concurrent kernels as streams and resources are not enough? Will GPU perform all the kernels anyway without concurrency or will it discard (not perform) the excess streams?
-
What is the basic entity which execute a streaming kernel? (is it a streaming multiprocessor?)
-
When a streaming kernel is executed, is it fully utilizing a streaming multiprocessor (SM)?
-
If the number of streams running concurrently is less than the number of streaming multiprocessors (SMs), will concurrent stream instructions be scheduled on different SMs?
Thank you.
-
If you launch > 16 concurrent kernels, it will keep scheduling kernels up until there are 16 running concurrently or until there are no longer enough resources (i.e. registers, shared mem) to schedule another one. Then as soon as kernels finish, more kernels can be launched.
-
This seems to be a metaphysical question and I’m not a philosopher. To me, the whole GPU is needed.
-
This completely depends on the amount of resources your kernel needs (e.g. registers, shared mem, threads)
-
It’s not clear how thread blocks are assigned to SMs, but I know at least the idle SMs are filled as soon as possible. If you really want to find out the scheduling behavior, you can log the %smid register in each of your thread blocks to see which SM is being used.
To save you some potential trouble that I’ve had, there’s one very unintuitive behavior for overlapping kernels on Fermi. Basically, there’s a hardware limitation (1 issue queue) such that a kernel cannot start until all previous ones have started (see the Kepler white paper’s explanation of HyperQ). This can artificially prevent streams from being executed concurrently. Therefore, a best practice on Fermi is to issue kernels in breadth-first order (issue all parallel kernels before issuing dependent kernels).
Hope that helps
Thank you for your reply.