Scheduling on Fermi

According to Fermi documents, SM_20 and higher CUDA devices should schedule threads at a warp level, rather than a thread block level as in GT-200 and previous models. Is there any further details about how this scheduler work? I think previous posts on this board about scheduling resulted in guys from NVIDIA telling me that Fermi has more ``flexible ‘’ scheduling, possibly implying better performance, maybe. But consider conventional platforms allow you to have certain affinity-control mechanisms, I wonder whether there is unexploited API features available for this. Or anybody has carried out certain micorbenchmarks on this matter. Thanks.

Oh, I didn’t catch this in the documentation. Could you point to the paper that cites this?

Oh, I didn’t catch this in the documentation. Could you point to the paper that cites this?

The original scheduler had some issues that were pointed out by developers… For instance, if some of the blocks exit in the middle of the computation, the schduler will not dynamically replace them… I hope this was fixed in FERMI and probably that is what they refer as “WARP” based scheduling…

Juss my guess though…

The original scheduler had some issues that were pointed out by developers… For instance, if some of the blocks exit in the middle of the computation, the schduler will not dynamically replace them… I hope this was fixed in FERMI and probably that is what they refer as “WARP” based scheduling…

Juss my guess though…

Sorry I think I’ve made a mistake here. I can recall that I’ve come across one preview slide about GF-100 architecture last year. It’s highly unlikely that scheduling is on Warp level, otherwise occupancy and many other problems would not exist. I can remotely recall that the slide commented the difference of scheduling difference between GT200 and GF100 and referred the GF100 schedules threads at a finer level, but I can’t find the slide now. Maybe the slide is comparing a TPC in GT200 against a SM in GF100, since an SM of GF100 now has more SP’s than a TPC of GT200. A warp in a TB can be executing on only 1 SM in an TPC on GT200, but now it can be executing on potentially more SP’s on GF100 (since more SP’s are available in one SM on GF100 than one TPC in GT200).

I still wonder how more flexible scheduling is possible on GF100, and how it is done. For example, when occupancy is 100%, with 256 threads per TB, and 6 TB’s are assigned to one SM, when one TB finishes, another TB (beyond those 6 TB’s, I mean) would start executing right away? or it has to wait for the other 5 TB’s to finish too.

Sorry I think I’ve made a mistake here. I can recall that I’ve come across one preview slide about GF-100 architecture last year. It’s highly unlikely that scheduling is on Warp level, otherwise occupancy and many other problems would not exist. I can remotely recall that the slide commented the difference of scheduling difference between GT200 and GF100 and referred the GF100 schedules threads at a finer level, but I can’t find the slide now. Maybe the slide is comparing a TPC in GT200 against a SM in GF100, since an SM of GF100 now has more SP’s than a TPC of GT200. A warp in a TB can be executing on only 1 SM in an TPC on GT200, but now it can be executing on potentially more SP’s on GF100 (since more SP’s are available in one SM on GF100 than one TPC in GT200).

I still wonder how more flexible scheduling is possible on GF100, and how it is done. For example, when occupancy is 100%, with 256 threads per TB, and 6 TB’s are assigned to one SM, when one TB finishes, another TB (beyond those 6 TB’s, I mean) would start executing right away? or it has to wait for the other 5 TB’s to finish too.

In fact, I also heard a similar story. The idea was that scheduler on GF100 can execute both pathways of a divergent branch concurrently, rather than serially.

However, I don’t know if this is true, or an academic speculation/proposal.

In fact, I also heard a similar story. The idea was that scheduler on GF100 can execute both pathways of a divergent branch concurrently, rather than serially.

However, I don’t know if this is true, or an academic speculation/proposal.

Sorry I didn’t quite get for which this is related to scheduling. For predication I can imagine that now GF-104 with CC-2.1 can dispatch more instructions and they use certain ILP exploitation techniques. Maybe fr predication they execute both paths and keep the right results. But this at least require more complex Write-Back mechanisms, since no register renaming is present. This might reduce latency especially that there is RAW (ReadAfterWrite) hazard on the predicate variable. This won’t be of much good in a massively parallel scenario such as GPU. Anyway, just some speculation on this topic…

Sorry I didn’t quite get for which this is related to scheduling. For predication I can imagine that now GF-104 with CC-2.1 can dispatch more instructions and they use certain ILP exploitation techniques. Maybe fr predication they execute both paths and keep the right results. But this at least require more complex Write-Back mechanisms, since no register renaming is present. This might reduce latency especially that there is RAW (ReadAfterWrite) hazard on the predicate variable. This won’t be of much good in a massively parallel scenario such as GPU. Anyway, just some speculation on this topic…

May be we are talking about different scheduling. I am referring to the 2 warp schedulers per SM that chose the next warp to issue an instruction from. I think you are talking about scheduling on top of that - choosing the next thread block or warp to run when other thread blocks retire.

May be we are talking about different scheduling. I am referring to the 2 warp schedulers per SM that chose the next warp to issue an instruction from. I think you are talking about scheduling on top of that - choosing the next thread block or warp to run when other thread blocks retire.

Regarding the block-level scheduling, Tim Murray posted a few insights into how the system handles concurrent kernel launches: http://forums.nvidia.com/index.php?s=&…t&p=1031594

If there is even no partitioning of SMs for different kernels, then it seems clear that individual blocks certainly are dynamically replaced immediately when they finish executing - possibly even from a different kernel!

Regarding the block-level scheduling, Tim Murray posted a few insights into how the system handles concurrent kernel launches: http://forums.nvidia.com/index.php?s=&…t&p=1031594

If there is even no partitioning of SMs for different kernels, then it seems clear that individual blocks certainly are dynamically replaced immediately when they finish executing - possibly even from a different kernel!

I have not seen any discussion on these forums (maybe I missed it) about how the ‘shape’ of parallel algorithms affect occupancy, scheduling, and concurrent kernels. By ‘shape’ I mean, at the low ‘block’ level, using a balanced binary tree that we then sweep in two directions (up sweep and down sweep) and, after the higher ‘grid’ level, combining results from the different grid block elements to achieve the total result. It is not always the case, but it often is, that during the sweeps or merges, that the number of operations is increasing and decreasing as the number of elements (nodes of the tree or elements being merged) decreases or increases due to the inherent data structure hierarchies (i.e. balanced binary tree root versus leaves). My visualization of this is of parallel algorithm operations being more of a ‘triangle’ in shape than ‘square’–the number of (unmasked) operations is changing and not constant throughout the algorithm (step). (A squashed triangle is still a triangle–not a square or rectangle.)

Doesn’t this usually mean that, while one of these algorithm steps is occurring, that the amount of resources needed is changing but that the same kernel is in effect? If so, then without concurrent kernels, resources are idle that could, potentially, be used–even within a single task (i.e. if the problem is larger than can fit on the GPU at one time and, if it is ‘pipelined’, than different execution stages could overlap to use these available idle resources). The ratio of resources used with concurrent kernels to resources idle would be, approximately (theoretically–ignoring cache, etc.) the speed up.

This speed up, as given in the previous, pipelined, example, would also be true for large tasks–not just the ‘small kernels’ that have previously been discussed. It would be true for the tasks using the prefix sum (scan) based algorithms and operations that are split and combined (unless, somehow, the split and combination are somehow constant in element width?).

I have not seen any discussion on these forums (maybe I missed it) about how the ‘shape’ of parallel algorithms affect occupancy, scheduling, and concurrent kernels. By ‘shape’ I mean, at the low ‘block’ level, using a balanced binary tree that we then sweep in two directions (up sweep and down sweep) and, after the higher ‘grid’ level, combining results from the different grid block elements to achieve the total result. It is not always the case, but it often is, that during the sweeps or merges, that the number of operations is increasing and decreasing as the number of elements (nodes of the tree or elements being merged) decreases or increases due to the inherent data structure hierarchies (i.e. balanced binary tree root versus leaves). My visualization of this is of parallel algorithm operations being more of a ‘triangle’ in shape than ‘square’–the number of (unmasked) operations is changing and not constant throughout the algorithm (step). (A squashed triangle is still a triangle–not a square or rectangle.)

Doesn’t this usually mean that, while one of these algorithm steps is occurring, that the amount of resources needed is changing but that the same kernel is in effect? If so, then without concurrent kernels, resources are idle that could, potentially, be used–even within a single task (i.e. if the problem is larger than can fit on the GPU at one time and, if it is ‘pipelined’, than different execution stages could overlap to use these available idle resources). The ratio of resources used with concurrent kernels to resources idle would be, approximately (theoretically–ignoring cache, etc.) the speed up.

This speed up, as given in the previous, pipelined, example, would also be true for large tasks–not just the ‘small kernels’ that have previously been discussed. It would be true for the tasks using the prefix sum (scan) based algorithms and operations that are split and combined (unless, somehow, the split and combination are somehow constant in element width?).