Understanding CUDA scheduling

Hello forum,

I have not understood some scheduling things of CUDA.

Suppose we are on a Maxwell such as GeForce GTX 750 (I might even be interested in a reply about the Keplers).

Firstly, according to this image:
http://cdn2.ubergizmo.com/wp-content/uploads/2014/02/nvidia-kepler-vs-maxwell-sm.gif
and this post by a nVidia chief technologist:
https://devblogs.nvidia.com/parallelforall/5-things-you-should-know-about-new-maxwell-gpu-architecture/
“Each warp scheduler still has the flexibility to dual-issue”

the Maxwell schedulers should do dual-issue

however according to this:
Programming Guide :: CUDA Toolkit Documentation
“Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.”
the schedulers do single-issue only (note that the same text for compute capability 3.x says that it does dual-issue there).

Question_1) So this is already a problem: which one is right, single-issue or dual-issue?

Now some more questions…

Suppose I launch a large number of kernels each one having 1 thread only (I really mean 1 thread, not just 1 warp), suppose they do independent 32-bit floating point additions (1 clock cycle long each one)
Note that the bottleneck in this case would be the scheduler, so:

Question_2) how many cores would be occupied computing, in each SMM functional unit?

I am guessing probably 2 cores for each SMM functional unit, because:
one new thread would start execution for each clock cycle in each SMM functional unit, because there is only one scheduler and it issues two instructions to the same warp, which has 1 thread only by my hypotesis (is the second instruction “remembered”? Or is it executed immediately simultaneously with the first by splitting the thread in two? I will suppose it is “remembered” and executed after the first).

So, such thread would execute two (dual-issue, correct?) FP operation in sequence, without needing interaction with the scheduler again between the first and the second FP operation. Is this correct?

So at every clock cycle a new thread enters execution, and it stays in execution for 2 clock cycles, so on average there would be 2 cores executing, so the answer for Question_2 would be 2 … but this assumes that the scheduler is capable of having two kernels simultaneously running, if core resources are enough (also see next question). Is it capable of this?

For the next question, suppose one SMM functional unit has two kernels running there, each one having only 1 warp with just 16 threads, all of them need to do independent 32bit FP operations (1 clock cycle long each one). Let’s suppose the scheduler really can do dual-issue:

Question_3) Is the scheduler capable of keeping the two kernels running simultaneously as long as the cores required by both do not exceed the cores available on the SMM functional unit (=32) ? (This is similar to question_2 actually)

Now suppose the two kernels of the previous question each have 1 warp of 32 threads, but 16 of such threads are conditionally disabled, so effectively only 16 need to compute, like in the previous example

Question_4) Does the answer to Question_3 change now?

Thanks for any enlightening
Ugo

Hi Ugo,

her my thoughts about your question:

  1. I would say the programming guide is a bit imprecise here. Since, also the whitepaper ([url]http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf[/url]) shows 2 dispatch units per warp scheduler. In Kepler, single issue was not enough to fully utilize the SMX cores. In Maxwell we have 32 cores per scheduler, i.e. one warp.
    “Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory
    operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores” ([url]https://devblogs.nvidia.com/parallelforall/5-things-you-should-know-about-new-maxwell-gpu-architecture/[/url])

2)-3) I do not fully understand what you mean by “kernels”. If I understand your question correctly, you ask what happens if you run a kernel with N blocks where each block has 1 thread or 16 threads. I am pretty sure, that threads from different blocks are never combined into one warp (but unfortunately I can’t find a quote).
In your case 2) that means: 4 warps are scheduled per SMM (4 schedulers) with one thread each, in total 4 cores are utilized per SMM. In case 3) it’s 4x16 cores.
4) It’s the same: Each scheduler issues up to one instruction to a warp (or according to answer (1) possibly an additional memory load/store).

Here is the quote: [url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture[/url]

Thanks for your reply hadschi118,

2)-3)
“kernel” is very well defined in CUDA :-) . Yes I think you understood the question correctly.

“I am pretty sure, that threads from different blocks are never combined into one warp”

Well, right, I am also pretty sure it’s not possible to combine threads to create a new warp, but I was not so sure that it’s not possible to execute more than one warp simultaneously. Please note that in Maxwell the SMX has been divided in four SMM Functional Units, and one FU has exactly only one scheduler. So you say that in a FU there can never be more than one warp executing in a certain moment?

Also, if your answer if correct, can you tell me what dual-issue means? How would your answer be different in the case Maxwell can do dual-issue, from the case in which Maxwell cannot do dual-issue?

Also, is dual-issue the ability to run two instructions for the same warp in the same clock cycle, or run two instructions for the same warp in consecutive clock cycles (without further scheduler intervention)? And if the latter is true, why there cannot be two warps executing simultaneously (the older one executing the second instruction of the previous dual-issue, the newer one executing the first instruction of the current dual-issue)?

Thank you

QUESTION 1

Kepler and Maxwell SMs both have 4 warp schedulers. On each cycle each warp scheduler picks an eligible warp (aka a warp that is not stalled) and issue 1 or 2 independent instructions from the warp. The number of execution units and the dual issuing rules are different on the two architectures.

QUESTION 2

CUDA cores is a marketing term for the number of integer/single-precision floating point execution units.

…they do independent 32-bit floating point additions (1 clock cycle long each one)…

No instruction on the GPU takes 1 cycle. Do you mean the instructions have no data dependency?

If you launch only 1 thread per kernel then the warp will only have 1 active thread as the compute work distributor and SM will not coalesce threads from different blocks (or kernels) into the same warp. When issuing to a set of execution units only 1/32 of the units will be utilized.

gm107/8 can execute 32 kernels simultaneously. Each SM is limited to 16 kernels. gm107 has 5 SMs (full chip). Each warp scheduler will manage 1 or 2 warps. The warp scheduler will stall often and when it executes and instruction it will execute 1 or 2 instructions for a warp at 1/32 utilization. The instruction cannot both be integer or floating point. The instruction mix has to use other execution units such as the special function unit, double precision floating point unit, texture unit, shared memory unit, etc.

without needing interaction with the scheduler again between the first and the second FP operation. Is this correct?

The scheduler chooses new warps every cycle so I’m unclear what you mean by without interaction. On cycle 1 it issues 2 instructions. On cycle 2 it picks a new warp (or the same warp) and issues 1-2 more instructions.

QUESTION 3

Is the scheduler capable of keeping the two kernels running simultaneously as long as the cores required by both do not exceed the cores available on the SMM functional unit (=32) ? (This is similar to question_2 actually)

You have some incorrect assumption regarding how super-scaler (dual-issue) pipelined microprocessor work. You may want to read additional literature.

The warp scheduler can issue instructions from multiple kernels. The pipeline is longer than 1 cycle (10s of cycles). Warps from multiple kernels can be in the pipeline of an execution unit at the same time. If you launch blocks with non-multiples of WARP_SIZE threads the warps will be dispatched to the execution unit with the lanes disabled reducing the execution units utilization.

QUESTION 4

[i]Now suppose the two kernels of the previous question each have 1 warp of 32 threads, but 16 of such threads are conditionally disabled, so effectively only 16 need to compute, like in the previous example

Question_4) Does the answer to Question_3 change now?[/i]

The scheduler dispatches all 32 lanes of the warp to the execution units with an active mask. Non-active threads execute through the pipe.

The majority of information in this post is in the Nsight VSE CUDA Profiler documentation and is observable in the Nsight VSE CUDA profiler reports.