The calling process of __device__ function

I am curious about the calling details of a __device__ function. As is well-known, the call to a function in CPU should undergo the following steps:

  1. The caller saves the context (registers). (This could also be implemented in the callee).
  2. The caller pushes the parameters into a stack.
  3. The callee executes the function.
  4. The caller restores the context.

This involves a stack. My question is that does cuda maintain such a stack for each thread? And does cuda save the context of each thread before calling a function? It could be quite space-consuming and slow (as the stack may reside in global or constant memory).

You seem to describe older stack-based calling convention. Calling conventions for modern processor architectures (e.g. x86-64, AArch64) are typically mostly register centric. While NVIDIA has (to the best of my knowledge) not publicly published an ABI for use on the GPU, it is clear from the support for various C++ features, support for separate compilation and device-code linking, and inspection of generated machine code that:

(1) An ABI must exist, is documented internally, and used by tool chain components
(2) Calling conventions governing function calls are largely register based

For special situations, such as the passing of many function arguments, or function arguments or returns that have aggregate types, register-centric calling conventions still require a stack, and this applies to the ABI used on GPUs as well.

A per-thread stack is implemented by CUDA. If you call cudaDeviceGetLimit() and inquired about cudaLimitStackSize, CUDA will report the current configured size of this stack.

Function calls that use the internally specified ABI calling conventions are indeed pretty slow, which is why the CUDA toolchain heavily relies on function inlining as an optimization strategy. There is also support for link-time optimization that allows (among other things) function inlining across compilation units. I have never examined myself how well that works. The fact that separate compilation with link-time optimization often results in performance close to that achieved by code resulting from whole-program compilation would seem to indicate that this works quite well.

Thanks. I have a bit more questions. Though the parameter passing may be register-centric, the funtion being called will change the register environment in their running process. The question is:

  1. Who is responsible for protecting the register context of the caller?
  2. When compiling caller and callee seperately, the register usage of them is different. For example, the caller may use 80 registers while the callee use 100 registers, what will happen when calling the function? This also happens when different callers call the same callee.

ABIs specify which registers are caller-saved and which ones are callee-saved. Since NVIDIA’s ABI for the GPU is not public (NVIDIA may share it under NDA with partners in addition to using it internally, but I do not know one way or the other), I do not have that information. You could reverse engineer most of it if you really needed it. What are you trying to accomplish that requires this information?

I do not know how the device-side linker handles functions with different register requirements. I think I knew some basics at one time, but since I have not needed that information for 10+ years, I cannot recall anything about this topic. Why do you need to know?

I’m just curious about how the function call process happens, because the register allocation scheme and performance requirement in GPU is totally different from that on CPU. I am wondering that if we use ABI to call a device function, the register occupancy may have some change and thus affect the performance?

Here is a similar question.