How CUDA driver set stack size on kernel invocation?

According to the CUDA Driver API documentation, cuCtxSetLimit is introduced as follows.

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g0651954dfb9788173e60a9af7201e65a

  • CU_LIMIT_STACK_SIZE controls the stack size in bytes of each GPU thread. Note that the CUDA driver will set the limit to the maximum of value and what the kernel function requires.
  • It looks to me the CUDA driver will automatically set what the kernel function requires, regardless of the configured limit size of the per-thread stack size.

    On the other hands, I got CUDA_ERROR_LAUNCH_FAILED due to lack of stack size, for a GPU kernel which allocates about char buf[2048] at least. After cuCtxSetLimit(CU_LIMIT_STACK_SIZE, 6000), it was solved.

    I wonder the following two points.

  • How to understand the description about? It looks to me the configured stack limit is working, even if what the kernel function requires is larger than the limitation.
  • How to know what the kernel function requires from the driver? I expected cuFuncGetAttribute() with CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES tells what I wanted, however, it returns 0 for the kernel which takes 2048 bytes buffer of the stack.
  • Best regards,