Hello forum,
I have not understood some scheduling things of CUDA.
Suppose we are on a Maxwell such as GeForce GTX 750 (I might even be interested in a reply about the Keplers).
Firstly, according to this image:
http://cdn2.ubergizmo.com/wp-content/uploads/2014/02/nvidia-kepler-vs-maxwell-sm.gif
and this post by a nVidia chief technologist:
https://devblogs.nvidia.com/parallelforall/5-things-you-should-know-about-new-maxwell-gpu-architecture/
“Each warp scheduler still has the flexibility to dual-issue”
the Maxwell schedulers should do dual-issue
however according to this:
Programming Guide :: CUDA Toolkit Documentation
“Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.”
the schedulers do single-issue only (note that the same text for compute capability 3.x says that it does dual-issue there).
Question_1) So this is already a problem: which one is right, single-issue or dual-issue?
Now some more questions…
Suppose I launch a large number of kernels each one having 1 thread only (I really mean 1 thread, not just 1 warp), suppose they do independent 32-bit floating point additions (1 clock cycle long each one)
Note that the bottleneck in this case would be the scheduler, so:
Question_2) how many cores would be occupied computing, in each SMM functional unit?
I am guessing probably 2 cores for each SMM functional unit, because:
one new thread would start execution for each clock cycle in each SMM functional unit, because there is only one scheduler and it issues two instructions to the same warp, which has 1 thread only by my hypotesis (is the second instruction “remembered”? Or is it executed immediately simultaneously with the first by splitting the thread in two? I will suppose it is “remembered” and executed after the first).
So, such thread would execute two (dual-issue, correct?) FP operation in sequence, without needing interaction with the scheduler again between the first and the second FP operation. Is this correct?
So at every clock cycle a new thread enters execution, and it stays in execution for 2 clock cycles, so on average there would be 2 cores executing, so the answer for Question_2 would be 2 … but this assumes that the scheduler is capable of having two kernels simultaneously running, if core resources are enough (also see next question). Is it capable of this?
For the next question, suppose one SMM functional unit has two kernels running there, each one having only 1 warp with just 16 threads, all of them need to do independent 32bit FP operations (1 clock cycle long each one). Let’s suppose the scheduler really can do dual-issue:
Question_3) Is the scheduler capable of keeping the two kernels running simultaneously as long as the cores required by both do not exceed the cores available on the SMM functional unit (=32) ? (This is similar to question_2 actually)
Now suppose the two kernels of the previous question each have 1 warp of 32 threads, but 16 of such threads are conditionally disabled, so effectively only 16 need to compute, like in the previous example
Question_4) Does the answer to Question_3 change now?
Thanks for any enlightening
Ugo