That is mostly correct. I wouldn’t phrase it this way:
No, the scheduler doesn’t organize threads. When threads are deposited on a SM, they are already numbered, and they already belong to a specific warp. The warp scheduler determines what instruction each warp (or partial warp, if there is conditional behavior called for by the source code) will execute next.
The stack is also not something that is created on the fly. A stack frame may be created “on the fly” (i.e. as part of the function call procedure), but the stack is always present. The stack is conceptually a pointer to a particular location in the local space of that thread. I also find this wording to be confusing or not correct:
When threads reach a (non-inlined) function call, they will follow the function call procedure that is common for most processors I am familiar with. ← Please click that link and read it first. (I’m not suggesting it is a perfectly accurate description of GPU function call behavior, but the general concepts there are useful for background understanding.) They will push the return address onto the stack, and then jump to the first instruction of the function.
A stack frame may be created if determined to be necessary by the compiler. The stack frame may contain more or less arbitrary information, as needed by the function call. It doesn’t contain instructions. Instructions are retrieved from the thread instruction stream, just as the processing of non-function-call instruction happens. The stack frame may store parameter/argument data needed by the function call. The stack frame may also store the states of various registers, so those registers can be “reused” by the the thread processing, as it is processing the function body. The function may also retrieve arguments by referring to specific registers that have pointers to those arguments. At the conclusion of the function body processing, those registers will be “restored” from the stack frame (an area in the local space of the thread), before the return address that was placed on the stack is put back into the instruction pointer register for the thread.
I don’t consider this description that I have given to be perfect (this sort of topic will inevitably attract word-smithing from those who are smarter than me, and requires a topic of approximately the length I linked to do a careful treatment) but I believe it gives a sufficient general understanding of what happens when a normal call is made to a non-inlined function.
Most of this is not publicly specified by NVIDIA, which means it could all change tomorrow. Nearly all of these concepts can be confirmed by careful study of SASS code of various test cases, eg. using the
cuobjdump tool. That is where my statements come from, not from any specific documentation or specification provided by NVIDIA. Therefore, these statements shouldn’t be used as a specification or guarantee of behavior by the NVIDIA compiler. They are merely my imperfect understanding of what I have witnessed by doing this sort of study.