Warp scheduling - have I got this right?

charlie1970 · February 9, 2013, 11:58pm

I can’t seem to find a complete explanation of exactly how kernels get scheduled and executed, so let me ask if I’ve got this correct.

GPU selects a Block that hasn’t been executed, and assigns it to a single, available MP.
A Block is only ever assigned to one MP (so that shared memory works)
An MP is only assigned one block at a time. It will not be assigned another block until execution of all threads in that first block is complete. [?]

So block scheduling takes place at the MP level. The Cuda reference manual (diagram) implies that blocks are allocated round-robin style to MPs before execution, but also that blocks can execute in any order.

Now it gets hazy…

(on my GT240) each of the 12 MPs has 8 cores. So the GPU assigns up to 8 of the block’s warps, one warp to a core in the block’s assigned MP.
If there are less than 8 warps in the block, cores go unused [this doesn’t sound right]
If the block contains more than 8 warps, the GPU will assigned the remaining warps to cores either as previous warps complete, or when a running warp starts waiting for a resource (like IO, or syncThreads).
I’m not clear on whether warps can switch to alternative cores, or whether cores can have multiple warps assigned to them?
each warp executes its 32 (or less) threads in lock-step, one instruction every 4 clock cycles (IO waits notwithstanding).

And now the fog descends…

I don’t believe all 8 warps on the 8 cores execute independently in parallel, but I don’t have a clear picture of how they execute with the clock? Something about MP’s only being able to complete one instruction per clock cycle.

How this varies with compute architectures is another question.

What score do I get for the above, or can someone point me towards the best on-line material covering this rather hardware orientated perspective?

Charlie

charlie1970 · February 10, 2013, 12:33am

…Actually, I’ve got (3) wrong haven’t I. Several blocks can be assigned simultaneously to an MP. Which I guess resolves my concern on point (5), since warps from more than one block are available to be scheduled on cores.

Zzzoom · February 10, 2013, 5:15am

Think of the 8 cores as an 8-wide vector unit that executes a warp in 4 clock cycles by splitting it into 4 segments.

pasoleatis · February 10, 2013, 8:50am

In fact you got one very important thing quite wrong.

Each warp is run over the all 8 cores and it does fast switching between warps when warp has to wait for data for example.

charlie1970 · February 10, 2013, 10:24am

Ah. Yes, that was quite a misconception.

I will re-read the documentation with this model in mind; see if it makes better sense.

Thank you.

Charlie

charlie1970 · February 10, 2013, 11:35am

I note that a GTX 480 has 15 MPs of 32 cores each.

Those 32 cores - do they still work in groups of 8 to execute a single warp, and so do they allow 4 warps to genuinely execute in parallel (semi-independently)?

pasoleatis · February 10, 2013, 4:01pm

No. The warp is executed on 32 cores. In practice you can have more than 1000 active threads on a MP, but only 1 warp runs at one time. This way the wating time to fetch data fro the memory is hidden. In practice you just play around with how many threads (warps) per block to get the best performance.

charlie1970 · February 10, 2013, 4:47pm

Ok.

Now (I think) the GTX 680 with Kepler has 192 Cores per SM (and they seem to be called SMX’s - assuming this is the multiprocessors). and (only!) 8 SMXs?

I can appreciate 32 cores working together on a single warp of 32 threads, but I can’t see that applyng to 192 cores.

Does the GTX 680 allow more than one warp to execute at a time on ech SMX (err 6?).

Charlie

pasoleatis · February 10, 2013, 5:08pm

I do not know so much about the 6xx , but my understanding is that the warp is still 32 threads not 192. I do not how are warps executed, but I think that now we have 6 warps executed in the same time per smx. SMX is not equivalent with 1 MP.

seibert · February 10, 2013, 7:38pm

Actually, on compute capability 2.0 hardware, the warp is executed on only 16 of the 32 CUDA cores at a time. The MP is designed to issue instructions from 2 different warps at the same time, which is how all 32 cores are used. On compute capability 2.1, there are 48 cores per MP, and the scheduling is even a little more tricky. Three instructions from two warps can be issued at once, so to use all 48 cores, the scheduler has to find two instructions from the same warp that are independent of each other. This is not always possible, so compute capability 2.1 devices tend to underutilize their CUDA cores.

Kepler is even more complex, with 4 warp schedulers per SMX, each able to issue up to two instructions per clock from one warp to any one of 16 different pipelines. The “CUDA core” pipelines are still 16 units wide, like in compute capability 2.x. That gives 12 compute pipelines per SMX operating independently. There are also 2 load/store pipelines and 2 special function unit pipelines, which also operate independently.

seibert · February 10, 2013, 7:53pm

In general, you should think of an SM(X) as a couple of functional blocks:

Local storage (registers, shared memory, L1 cache, block information and a list of available warps). There is enough local storage to have hundreds or more threads in an active state (ready for dispatch).
A front-end of schedulers and dispatchers that select warps from those available, and then select the next instruction(s) from those warps and send them to the compute pipelines.
A set of instruction pipelines, whose “width” depends on the architecture. A “CUDA core” is really just one lane of a particular kind of pipeline that executes the vast majority of instructions, like arithmetic and branching. Execution of a single instruction takes many clock cycles (usually something like 20), but many instructions will be in the pipeline at the same time, generally from different warps. A CUDA core completes the instruction from one thread every clock cycle. CUDA cores are ganged into pipelines of width 8 or 16 depending on the architecture.

This design is what makes CUDA a “throughput-optimized” architecture.

charlie1970 · February 10, 2013, 8:58pm

…Coming back to my quaint GT 240 capability 1.2, 12 MP and 8 Cores/MP…

My test rig kernel that records warp start and end times, and assigned %smids is implying that an MP can only have 8 distinct blocks assigned to it for scheduling at a time. Does this match expectations?

E.g. If I give the GPU a kernel<<<97,32>>, MP 0 maps out as (other MP results ignored)

Block Warp Start End SMID Duration
0 0 66 1818 0 1752
24 0 70 1744 0 1674
48 0 74 1752 0 1678
36 0 80 1748 0 1668
60 0 84 1756 0 1672
12 0 88 1762 0 1674
72 0 92 1802 0 1710
84 0 128 1814 0 1686
96 0 4062 4232 0 170

That final block #96 clearly only starts after the previous 8 have completed (start delay presumably caused by a memory bottleneck getting the previous thread results out to global memory). It also completes very fast without any other warps to consume time slices or memory bandwidth.

But if I give it kernel<<<49,128>>, MP 0 maps out as

Block Warp Start End SMID Duration
0 0 60 2150 0 2090
0 2 64 2062 0 1998
12 0 68 2094 0 2026
0 3 74 2218 0 2144
12 1 78 2136 0 2058
0 1 82 2160 0 2078
24 1 86 2074 0 1988
24 2 90 2114 0 2024
24 3 94 2110 0 2016
36 0 98 2122 0 2024
36 1 102 2118 0 2016
36 2 106 2130 0 2024
36 3 110 2222 0 2112
48 0 114 2126 0 2012
48 1 118 2140 0 2022
48 2 122 2144 0 2022
48 3 126 2318 0 2192
12 3 132 2266 0 2134
12 2 182 2098 0 1916
24 0 230 2182 0 1952

Which is only 5 distinct blocks, but 20 warps. Clearly time sliced.

[All noted as highly academic, since for real problems I’d be giving it plenty of blocks and warps, and tuning by experience]

Charlie

[Edit: Sorry, those tables of data don’t dislpay well in the post]

seibert · February 11, 2013, 4:11pm

The CUDA Programming Guide, Appendix F, lists the block and warp limitations of the different compute capabilities. In your case, the guide states that compute capability 1.2 is limited to 8 blocks, 32 warps, and 16384 registers per SM.

So your first case makes sense, due to the 8 block limit, and your second case also makes sense if the number of blocks is limited by the number of registers (or shared memory, if you use that).

kbam · February 12, 2013, 5:04am

Hi Charlie,

A key thing to remember is it is Single Instruction Multiple Data ‘SIMD’, all 32 threads in a warp must execute the same instruction, and they do it in parallel. The design is that they do it in one clock cycle, but on older devices, e.g. yours with only 8 cores per SM it is done 8 threads at a time and takes 4 clock cycles to process the whole warp. Then and only then comes the opportunity to start the next instruction (which may be on a different warp)

As techniques for making computer/GPU chips have improved, the number of transistors able to be made on a single chip has increased, so the number of cores nVidia has been able to put on a chip has increased.

So to do 1 instruction on a warp:
On your GT240 it takes 4 clock cycles per warp
Later GPU’s 1 clock cycle ( i.e. 32 cores doing the 32 threads simultaneously)
The newest GPU’s have 192 cores per SM and can process several different warps in the same clock cycle. If you want more details of the newer chips have a look at the NVIDIA-Kepler-GK110-Architecture-Whitepaper

Regardless of which GPU you are using all threads in a warp still execute the same instruction together and your algorithm will probably be the same.

Cheers

charlie1970 · February 12, 2013, 7:56pm

Thanks. These explanations are gold.

I will stop digging in a minute, but …

The Cuda C Programming guide has a section on “Maximize Instruction Throughput” containing a table entitled “Table 2. Throughput of Native Arithmetic Instructions. (Operations per Clock Cycle per Multiprocessor)”

Now for CC 1.0 1.1 and 1.2, it states 8 for most items (I assume 1 per core), and less for some other instructions – fine. But then it states 32 bit integer add = 10 ops per clock, per SM.

a) Where are those 2 extra ops per clock cycle coming from?
b) Will I regret asking this?

Charlie

seibert · February 12, 2013, 8:41pm

Huh, I never noticed that 10 in the table. I have no idea where that comes from, and since the size of a warp is not divisible by 10, the actual throughput can’t be that simple. I wonder if this is related to dual issue in the old compute capability 1.x devices, where the special function unit could sometimes perform basic arithmetic along with the regular CUDA cores. However, I’m just speculating here…

charlie1970 · February 12, 2013, 9:51pm

That whole row is fully of mystery. For the CC 3.0 and 3.5 columns, where 32 bit floating point operations scale up from 8 to 192 (in line with the cores), the 32 bit integer add only rises to 160.

Why have the cores got it in for integer operations?

seibert · February 12, 2013, 11:09pm

Yeah, I did notice the drop in integer throughput on compute capability 3.x when they came out. Kepler seems to have been rebalanced to dramatically improve power efficiency (which was pretty successful), and NVIDIA decided to shave a little die area by dropping some of the integer instruction throughput.

I suspect that the 160 number comes from a decision to only support integer instructions on 10 of the 12 general purpose pipelines (each consisting of 16 CUDA cores). I’m not sure how much that saves, but apparently it was worth it.

Topic		Replies	Views
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15806	February 4, 2011
Warp Size Question CUDA Programming and Performance	21	14267	June 18, 2010
GPU architecture and warp scheduling CUDA Programming and Performance	10	20667	February 10, 2018
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	25114	September 6, 2009
Thread Scheduling Concept CUDA Programming and Performance	3	3848	June 21, 2012
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2656	November 1, 2012
How many parallel threads? CUDA Programming and Performance	19	10444	October 1, 2021
Warp thread Scheduling CUDA Programming and Performance	7	2326	June 28, 2010
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	20128	July 5, 2011
Can threads in a warp from different blocks? CUDA Programming and Performance	17	12032	March 26, 2010

Warp scheduling - have I got this right?

Related topics