Scheduling on Fermi

auhgnist · August 7, 2010, 11:07pm

According to Fermi documents, SM_20 and higher CUDA devices should schedule threads at a warp level, rather than a thread block level as in GT-200 and previous models. Is there any further details about how this scheduler work? I think previous posts on this board about scheduling resulted in guys from NVIDIA telling me that Fermi has more ``flexible ‘’ scheduling, possibly implying better performance, maybe. But consider conventional platforms allow you to have certain affinity-control mechanisms, I wonder whether there is unexploited API features available for this. Or anybody has carried out certain micorbenchmarks on this matter. Thanks.

vvolkov · August 8, 2010, 4:07am

Oh, I didn’t catch this in the documentation. Could you point to the paper that cites this?

vvolkov · August 8, 2010, 4:07am

Oh, I didn’t catch this in the documentation. Could you point to the paper that cites this?

Sarnath · August 9, 2010, 6:05am

The original scheduler had some issues that were pointed out by developers… For instance, if some of the blocks exit in the middle of the computation, the schduler will not dynamically replace them… I hope this was fixed in FERMI and probably that is what they refer as “WARP” based scheduling…

Juss my guess though…

Sarnath · August 9, 2010, 6:05am

The original scheduler had some issues that were pointed out by developers… For instance, if some of the blocks exit in the middle of the computation, the schduler will not dynamically replace them… I hope this was fixed in FERMI and probably that is what they refer as “WARP” based scheduling…

Juss my guess though…

auhgnist · August 9, 2010, 6:29am

Sorry I think I’ve made a mistake here. I can recall that I’ve come across one preview slide about GF-100 architecture last year. It’s highly unlikely that scheduling is on Warp level, otherwise occupancy and many other problems would not exist. I can remotely recall that the slide commented the difference of scheduling difference between GT200 and GF100 and referred the GF100 schedules threads at a finer level, but I can’t find the slide now. Maybe the slide is comparing a TPC in GT200 against a SM in GF100, since an SM of GF100 now has more SP’s than a TPC of GT200. A warp in a TB can be executing on only 1 SM in an TPC on GT200, but now it can be executing on potentially more SP’s on GF100 (since more SP’s are available in one SM on GF100 than one TPC in GT200).

I still wonder how more flexible scheduling is possible on GF100, and how it is done. For example, when occupancy is 100%, with 256 threads per TB, and 6 TB’s are assigned to one SM, when one TB finishes, another TB (beyond those 6 TB’s, I mean) would start executing right away? or it has to wait for the other 5 TB’s to finish too.

auhgnist · August 9, 2010, 6:29am

Sorry I think I’ve made a mistake here. I can recall that I’ve come across one preview slide about GF-100 architecture last year. It’s highly unlikely that scheduling is on Warp level, otherwise occupancy and many other problems would not exist. I can remotely recall that the slide commented the difference of scheduling difference between GT200 and GF100 and referred the GF100 schedules threads at a finer level, but I can’t find the slide now. Maybe the slide is comparing a TPC in GT200 against a SM in GF100, since an SM of GF100 now has more SP’s than a TPC of GT200. A warp in a TB can be executing on only 1 SM in an TPC on GT200, but now it can be executing on potentially more SP’s on GF100 (since more SP’s are available in one SM on GF100 than one TPC in GT200).

I still wonder how more flexible scheduling is possible on GF100, and how it is done. For example, when occupancy is 100%, with 256 threads per TB, and 6 TB’s are assigned to one SM, when one TB finishes, another TB (beyond those 6 TB’s, I mean) would start executing right away? or it has to wait for the other 5 TB’s to finish too.

vvolkov · August 9, 2010, 8:19am

In fact, I also heard a similar story. The idea was that scheduler on GF100 can execute both pathways of a divergent branch concurrently, rather than serially.

However, I don’t know if this is true, or an academic speculation/proposal.

vvolkov · August 9, 2010, 8:19am

In fact, I also heard a similar story. The idea was that scheduler on GF100 can execute both pathways of a divergent branch concurrently, rather than serially.

However, I don’t know if this is true, or an academic speculation/proposal.

auhgnist · August 9, 2010, 8:35am

Sorry I didn’t quite get for which this is related to scheduling. For predication I can imagine that now GF-104 with CC-2.1 can dispatch more instructions and they use certain ILP exploitation techniques. Maybe fr predication they execute both paths and keep the right results. But this at least require more complex Write-Back mechanisms, since no register renaming is present. This might reduce latency especially that there is RAW (ReadAfterWrite) hazard on the predicate variable. This won’t be of much good in a massively parallel scenario such as GPU. Anyway, just some speculation on this topic…

auhgnist · August 9, 2010, 8:35am

Sorry I didn’t quite get for which this is related to scheduling. For predication I can imagine that now GF-104 with CC-2.1 can dispatch more instructions and they use certain ILP exploitation techniques. Maybe fr predication they execute both paths and keep the right results. But this at least require more complex Write-Back mechanisms, since no register renaming is present. This might reduce latency especially that there is RAW (ReadAfterWrite) hazard on the predicate variable. This won’t be of much good in a massively parallel scenario such as GPU. Anyway, just some speculation on this topic…

vvolkov · August 9, 2010, 8:50am

May be we are talking about different scheduling. I am referring to the 2 warp schedulers per SM that chose the next warp to issue an instruction from. I think you are talking about scheduling on top of that - choosing the next thread block or warp to run when other thread blocks retire.

vvolkov · August 9, 2010, 8:50am

May be we are talking about different scheduling. I am referring to the 2 warp schedulers per SM that chose the next warp to issue an instruction from. I think you are talking about scheduling on top of that - choosing the next thread block or warp to run when other thread blocks retire.

MisterAnderson42 · August 9, 2010, 12:16pm

Regarding the block-level scheduling, Tim Murray posted a few insights into how the system handles concurrent kernel launches: [url=“http://forums.nvidia.com/index.php?s=&showtopic=164912&view=findpost&p=1031594”]http://forums.nvidia.com/index.php?s=&...t&p=1031594[/url]

If there is even no partitioning of SMs for different kernels, then it seems clear that individual blocks certainly are dynamically replaced immediately when they finish executing - possibly even from a different kernel!

MisterAnderson42 · August 9, 2010, 12:16pm

Regarding the block-level scheduling, Tim Murray posted a few insights into how the system handles concurrent kernel launches: [url=“http://forums.nvidia.com/index.php?s=&showtopic=164912&view=findpost&p=1031594”]http://forums.nvidia.com/index.php?s=&...t&p=1031594[/url]

If there is even no partitioning of SMs for different kernels, then it seems clear that individual blocks certainly are dynamically replaced immediately when they finish executing - possibly even from a different kernel!

kappa · August 9, 2010, 2:59pm

I have not seen any discussion on these forums (maybe I missed it) about how the ‘shape’ of parallel algorithms affect occupancy, scheduling, and concurrent kernels. By ‘shape’ I mean, at the low ‘block’ level, using a balanced binary tree that we then sweep in two directions (up sweep and down sweep) and, after the higher ‘grid’ level, combining results from the different grid block elements to achieve the total result. It is not always the case, but it often is, that during the sweeps or merges, that the number of operations is increasing and decreasing as the number of elements (nodes of the tree or elements being merged) decreases or increases due to the inherent data structure hierarchies (i.e. balanced binary tree root versus leaves). My visualization of this is of parallel algorithm operations being more of a ‘triangle’ in shape than ‘square’–the number of (unmasked) operations is changing and not constant throughout the algorithm (step). (A squashed triangle is still a triangle–not a square or rectangle.)

Doesn’t this usually mean that, while one of these algorithm steps is occurring, that the amount of resources needed is changing but that the same kernel is in effect? If so, then without concurrent kernels, resources are idle that could, potentially, be used–even within a single task (i.e. if the problem is larger than can fit on the GPU at one time and, if it is ‘pipelined’, than different execution stages could overlap to use these available idle resources). The ratio of resources used with concurrent kernels to resources idle would be, approximately (theoretically–ignoring cache, etc.) the speed up.

This speed up, as given in the previous, pipelined, example, would also be true for large tasks–not just the ‘small kernels’ that have previously been discussed. It would be true for the tasks using the prefix sum (scan) based algorithms and operations that are split and combined (unless, somehow, the split and combination are somehow constant in element width?).

kappa · August 9, 2010, 2:59pm

I have not seen any discussion on these forums (maybe I missed it) about how the ‘shape’ of parallel algorithms affect occupancy, scheduling, and concurrent kernels. By ‘shape’ I mean, at the low ‘block’ level, using a balanced binary tree that we then sweep in two directions (up sweep and down sweep) and, after the higher ‘grid’ level, combining results from the different grid block elements to achieve the total result. It is not always the case, but it often is, that during the sweeps or merges, that the number of operations is increasing and decreasing as the number of elements (nodes of the tree or elements being merged) decreases or increases due to the inherent data structure hierarchies (i.e. balanced binary tree root versus leaves). My visualization of this is of parallel algorithm operations being more of a ‘triangle’ in shape than ‘square’–the number of (unmasked) operations is changing and not constant throughout the algorithm (step). (A squashed triangle is still a triangle–not a square or rectangle.)

Doesn’t this usually mean that, while one of these algorithm steps is occurring, that the amount of resources needed is changing but that the same kernel is in effect? If so, then without concurrent kernels, resources are idle that could, potentially, be used–even within a single task (i.e. if the problem is larger than can fit on the GPU at one time and, if it is ‘pipelined’, than different execution stages could overlap to use these available idle resources). The ratio of resources used with concurrent kernels to resources idle would be, approximately (theoretically–ignoring cache, etc.) the speed up.

This speed up, as given in the previous, pipelined, example, would also be true for large tasks–not just the ‘small kernels’ that have previously been discussed. It would be true for the tasks using the prefix sum (scan) based algorithms and operations that are split and combined (unless, somehow, the split and combination are somehow constant in element width?).

Topic		Replies	Views
concurrently running blocks from multiple kernels on the same SM related to Fermi and unified shader CUDA Programming and Performance	14	3877	November 30, 2010
Does a CUDA thread get assigned to a specific core from the start and until it finishes execution? CUDA Programming and Performance	12	4800	May 2, 2011
Can threads in a warp from different blocks? CUDA Programming and Performance	17	11847	March 26, 2010
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17730	April 1, 2010
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13203	January 22, 2010
Fermi speculation Kernel invocation in kernel code CUDA Programming and Performance	10	4298	October 20, 2009
Concurrently kernels running on one device CUDA Programming and Performance	17	2738	March 2, 2010
Concurrent Kernel Execution on Fermi - confussion CUDA Programming and Performance	13	1657	October 10, 2011
GTC Keynote Thread CUDA Programming and Performance	31	19819	May 23, 2012
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10343	April 5, 2012

Scheduling on Fermi

Related topics