Warp scheduling - have I got this right?

I can’t seem to find a complete explanation of exactly how kernels get scheduled and executed, so let me ask if I’ve got this correct.

  1. GPU selects a Block that hasn’t been executed, and assigns it to a single, available MP.
  2. A Block is only ever assigned to one MP (so that shared memory works)
  3. An MP is only assigned one block at a time. It will not be assigned another block until execution of all threads in that first block is complete. [?]

So block scheduling takes place at the MP level. The Cuda reference manual (diagram) implies that blocks are allocated round-robin style to MPs before execution, but also that blocks can execute in any order.

Now it gets hazy…

  1. (on my GT240) each of the 12 MPs has 8 cores. So the GPU assigns up to 8 of the block’s warps, one warp to a core in the block’s assigned MP.

  2. If there are less than 8 warps in the block, cores go unused [this doesn’t sound right]

  3. If the block contains more than 8 warps, the GPU will assigned the remaining warps to cores either as previous warps complete, or when a running warp starts waiting for a resource (like IO, or syncThreads).

  4. I’m not clear on whether warps can switch to alternative cores, or whether cores can have multiple warps assigned to them?

  5. each warp executes its 32 (or less) threads in lock-step, one instruction every 4 clock cycles (IO waits notwithstanding).

And now the fog descends…

  1. I don’t believe all 8 warps on the 8 cores execute independently in parallel, but I don’t have a clear picture of how they execute with the clock? Something about MP’s only being able to complete one instruction per clock cycle.

How this varies with compute architectures is another question.

What score do I get for the above, or can someone point me towards the best on-line material covering this rather hardware orientated perspective?

Charlie

…Actually, I’ve got (3) wrong haven’t I. Several blocks can be assigned simultaneously to an MP. Which I guess resolves my concern on point (5), since warps from more than one block are available to be scheduled on cores.

Think of the 8 cores as an 8-wide vector unit that executes a warp in 4 clock cycles by splitting it into 4 segments.

In fact you got one very important thing quite wrong.

Each warp is run over the all 8 cores and it does fast switching between warps when warp has to wait for data for example.

Ah. Yes, that was quite a misconception.

I will re-read the documentation with this model in mind; see if it makes better sense.

Thank you.

Charlie

I note that a GTX 480 has 15 MPs of 32 cores each.

Those 32 cores - do they still work in groups of 8 to execute a single warp, and so do they allow 4 warps to genuinely execute in parallel (semi-independently)?

No. The warp is executed on 32 cores. In practice you can have more than 1000 active threads on a MP, but only 1 warp runs at one time. This way the wating time to fetch data fro the memory is hidden. In practice you just play around with how many threads (warps) per block to get the best performance.

Ok.

Now (I think) the GTX 680 with Kepler has 192 Cores per SM (and they seem to be called SMX’s - assuming this is the multiprocessors). and (only!) 8 SMXs?

I can appreciate 32 cores working together on a single warp of 32 threads, but I can’t see that applyng to 192 cores.

Does the GTX 680 allow more than one warp to execute at a time on ech SMX (err 6?).

Charlie

I do not know so much about the 6xx , but my understanding is that the warp is still 32 threads not 192. I do not how are warps executed, but I think that now we have 6 warps executed in the same time per smx. SMX is not equivalent with 1 MP.

Actually, on compute capability 2.0 hardware, the warp is executed on only 16 of the 32 CUDA cores at a time. The MP is designed to issue instructions from 2 different warps at the same time, which is how all 32 cores are used. On compute capability 2.1, there are 48 cores per MP, and the scheduling is even a little more tricky. Three instructions from two warps can be issued at once, so to use all 48 cores, the scheduler has to find two instructions from the same warp that are independent of each other. This is not always possible, so compute capability 2.1 devices tend to underutilize their CUDA cores.

Kepler is even more complex, with 4 warp schedulers per SMX, each able to issue up to two instructions per clock from one warp to any one of 16 different pipelines. The “CUDA core” pipelines are still 16 units wide, like in compute capability 2.x. That gives 12 compute pipelines per SMX operating independently. There are also 2 load/store pipelines and 2 special function unit pipelines, which also operate independently.

In general, you should think of an SM(X) as a couple of functional blocks:

  • Local storage (registers, shared memory, L1 cache, block information and a list of available warps). There is enough local storage to have hundreds or more threads in an active state (ready for dispatch).

  • A front-end of schedulers and dispatchers that select warps from those available, and then select the next instruction(s) from those warps and send them to the compute pipelines.

  • A set of instruction pipelines, whose “width” depends on the architecture. A “CUDA core” is really just one lane of a particular kind of pipeline that executes the vast majority of instructions, like arithmetic and branching. Execution of a single instruction takes many clock cycles (usually something like 20), but many instructions will be in the pipeline at the same time, generally from different warps. A CUDA core completes the instruction from one thread every clock cycle. CUDA cores are ganged into pipelines of width 8 or 16 depending on the architecture.

This design is what makes CUDA a “throughput-optimized” architecture.

…Coming back to my quaint GT 240 capability 1.2, 12 MP and 8 Cores/MP…

My test rig kernel that records warp start and end times, and assigned %smids is implying that an MP can only have 8 distinct blocks assigned to it for scheduling at a time. Does this match expectations?

E.g. If I give the GPU a kernel<<<97,32>>, MP 0 maps out as (other MP results ignored)

Block Warp Start End SMID Duration
0 0 66 1818 0 1752
24 0 70 1744 0 1674
48 0 74 1752 0 1678
36 0 80 1748 0 1668
60 0 84 1756 0 1672
12 0 88 1762 0 1674
72 0 92 1802 0 1710
84 0 128 1814 0 1686
96 0 4062 4232 0 170

That final block #96 clearly only starts after the previous 8 have completed (start delay presumably caused by a memory bottleneck getting the previous thread results out to global memory). It also completes very fast without any other warps to consume time slices or memory bandwidth.

But if I give it kernel<<<49,128>>, MP 0 maps out as

Block Warp Start End SMID Duration
0 0 60 2150 0 2090
0 2 64 2062 0 1998
12 0 68 2094 0 2026
0 3 74 2218 0 2144
12 1 78 2136 0 2058
0 1 82 2160 0 2078
24 1 86 2074 0 1988
24 2 90 2114 0 2024
24 3 94 2110 0 2016
36 0 98 2122 0 2024
36 1 102 2118 0 2016
36 2 106 2130 0 2024
36 3 110 2222 0 2112
48 0 114 2126 0 2012
48 1 118 2140 0 2022
48 2 122 2144 0 2022
48 3 126 2318 0 2192
12 3 132 2266 0 2134
12 2 182 2098 0 1916
24 0 230 2182 0 1952

Which is only 5 distinct blocks, but 20 warps. Clearly time sliced.

[All noted as highly academic, since for real problems I’d be giving it plenty of blocks and warps, and tuning by experience]

Charlie

[Edit: Sorry, those tables of data don’t dislpay well in the post]

The CUDA Programming Guide, Appendix F, lists the block and warp limitations of the different compute capabilities. In your case, the guide states that compute capability 1.2 is limited to 8 blocks, 32 warps, and 16384 registers per SM.

So your first case makes sense, due to the 8 block limit, and your second case also makes sense if the number of blocks is limited by the number of registers (or shared memory, if you use that).

Hi Charlie,

A key thing to remember is it is Single Instruction Multiple Data ‘SIMD’, all 32 threads in a warp must execute the same instruction, and they do it in parallel. The design is that they do it in one clock cycle, but on older devices, e.g. yours with only 8 cores per SM it is done 8 threads at a time and takes 4 clock cycles to process the whole warp. Then and only then comes the opportunity to start the next instruction (which may be on a different warp)

As techniques for making computer/GPU chips have improved, the number of transistors able to be made on a single chip has increased, so the number of cores nVidia has been able to put on a chip has increased.

So to do 1 instruction on a warp:
On your GT240 it takes 4 clock cycles per warp
Later GPU’s 1 clock cycle ( i.e. 32 cores doing the 32 threads simultaneously)
The newest GPU’s have 192 cores per SM and can process several different warps in the same clock cycle. If you want more details of the newer chips have a look at the NVIDIA-Kepler-GK110-Architecture-Whitepaper

Regardless of which GPU you are using all threads in a warp still execute the same instruction together and your algorithm will probably be the same.

Cheers

Thanks. These explanations are gold.

I will stop digging in a minute, but …

The Cuda C Programming guide has a section on “Maximize Instruction Throughput” containing a table entitled “Table 2. Throughput of Native Arithmetic Instructions. (Operations per Clock Cycle per Multiprocessor)”

Now for CC 1.0 1.1 and 1.2, it states 8 for most items (I assume 1 per core), and less for some other instructions – fine. But then it states 32 bit integer add = 10 ops per clock, per SM.

a) Where are those 2 extra ops per clock cycle coming from?
b) Will I regret asking this?

Charlie

Huh, I never noticed that 10 in the table. I have no idea where that comes from, and since the size of a warp is not divisible by 10, the actual throughput can’t be that simple. I wonder if this is related to dual issue in the old compute capability 1.x devices, where the special function unit could sometimes perform basic arithmetic along with the regular CUDA cores. However, I’m just speculating here…

That whole row is fully of mystery. For the CC 3.0 and 3.5 columns, where 32 bit floating point operations scale up from 8 to 192 (in line with the cores), the 32 bit integer add only rises to 160.

Why have the cores got it in for integer operations?

Yeah, I did notice the drop in integer throughput on compute capability 3.x when they came out. Kepler seems to have been rebalanced to dramatically improve power efficiency (which was pretty successful), and NVIDIA decided to shave a little die area by dropping some of the integer instruction throughput.

I suspect that the 160 number comes from a decision to only support integer instructions on 10 of the 12 general purpose pipelines (each consisting of 16 CUDA cores). I’m not sure how much that saves, but apparently it was worth it.