How are device functions executed by warps?

Ziqi · May 31, 2022, 2:05am

Suppose we have a kernel function defined as follows:

__global__
void foo()
{
    boo();
}

while boo is defined as

__device__
void boo()
{
    some statement 0;
    some statement 1;
}

When a warp starts from the foo kernel and executes to the boo(); statement, does the warp exhaust all boo before creating the next warp, or does the warp go inside the body of boo and finishes after executing each statement?

Robert_Crovella · May 31, 2022, 5:03am

each thread in every warp executes boo(). The fine-grained execution behavior is not specified. In some situations, the warp executes in lockstep, but that mental model will break as you consider more complicated codes.

The basic principle that should govern the CUDA programmer is that CUDA specifies no particular order of thread execution.

warps are not created, except at the point that the threadblock is deposited by the GPU block scheduler onto the SM. For any given threadblock, all warps are created at precisely the same instant.

Based on your cross posting, it seems you believe that device threads have no stack. That is incorrect. They (each) have a stack, and they can call functions using the usual method implemented by many stack based processors.

A fully inlined function does not need a call procedure, of course.

Ziqi · May 31, 2022, 1:08pm

Sorry for the confusion caused by my asking on stack overflow. I meant __inline__ device functions do not cause stack creation, as they are lines of codes. According to your explanation, CUDA warps for a block are created as it is scheduled onto an SM. The following statement is my understanding for thread execution: states of all threads of a scheduled block are somehow kept by the SM and the scheduler organizes threads that have the same instructions to execute into a group (when there is no branch deviation, the group includes all 32 threads corresponding to the warp) and schedule that group to its warp determined from the very start. When some threads execute to a non-inlined device function, the SM will record them entering a new stack frame, and update their execution to the first instruction in the device function. The instructions in the new stack frame from these threads will then be scheduled to be executed by warps as usual. So there is no warp creation, but only thread states updating and warp scheduling. Is this understanding correct? Thanks for replying!

Robert_Crovella · May 31, 2022, 3:02pm

That is mostly correct. I wouldn’t phrase it this way:

No, the scheduler doesn’t organize threads. When threads are deposited on a SM, they are already numbered, and they already belong to a specific warp. The warp scheduler determines what instruction each warp (or partial warp, if there is conditional behavior called for by the source code) will execute next.

The stack is also not something that is created on the fly. A stack frame may be created “on the fly” (i.e. as part of the function call procedure), but the stack is always present. The stack is conceptually a pointer to a particular location in the local space of that thread. I also find this wording to be confusing or not correct:

To recap:

When threads reach a (non-inlined) function call, they will follow the function call procedure that is common for most processors I am familiar with. ← Please click that link and read it first. (I’m not suggesting it is a perfectly accurate description of GPU function call behavior, but the general concepts there are useful for background understanding.) They will push the return address onto the stack, and then jump to the first instruction of the function.

A stack frame may be created if determined to be necessary by the compiler. The stack frame may contain more or less arbitrary information, as needed by the function call. It doesn’t contain instructions. Instructions are retrieved from the thread instruction stream, just as the processing of non-function-call instruction happens. The stack frame may store parameter/argument data needed by the function call. The stack frame may also store the states of various registers, so those registers can be “reused” by the the thread processing, as it is processing the function body. The function may also retrieve arguments by referring to specific registers that have pointers to those arguments. At the conclusion of the function body processing, those registers will be “restored” from the stack frame (an area in the local space of the thread), before the return address that was placed on the stack is put back into the instruction pointer register for the thread.

I don’t consider this description that I have given to be perfect (this sort of topic will inevitably attract word-smithing from those who are smarter than me, and requires a topic of approximately the length I linked to do a careful treatment) but I believe it gives a sufficient general understanding of what happens when a normal call is made to a non-inlined function.

Most of this is not publicly specified by NVIDIA, which means it could all change tomorrow. Nearly all of these concepts can be confirmed by careful study of SASS code of various test cases, eg. using the cuobjdump tool. That is where my statements come from, not from any specific documentation or specification provided by NVIDIA. Therefore, these statements shouldn’t be used as a specification or guarantee of behavior by the NVIDIA compiler. They are merely my imperfect understanding of what I have witnessed by doing this sort of study.