How to understand and set the stack size ?

Hi,

The optix programming guide suggests us to minimize the stack as much as possible. However, it becomes annoying when the application is under development, as each new change may need re-adjust of the stack. I am wandering if there are ways to estimate the stack budget rather than setting a very dense value each time.

For doing this, I need better understand the stack size, how and where it is used. From the comments of OptiX SDK 5.1.1, I only know, the stack size is in bytes and is a global value for a Optix Context.

  1. But, what is the reference object for this value. Is it the bytes for a CUDA block or thread?
  2. If my OptiX kernel is register bound, could I make a large stack budget once?
  3. From here https://devtalk.nvidia.com/default/topic/527768/optix/optix-stack-size/ I know, the stack in OptiX is LMEM that is cached with L1, does it mean the more stack I set for OptiX, the less L1 is left for other caching?

I tried to increase my stack size from 1000 to 3000, for example, and there is no visible performance difference with a few test scenes. So I guess, there should be some ways to estimate the budget.

BTW. In OptiX 5.1.1 example PathTracer, changing stack size from 1800 to 40000 on GTX 1080Ti doesn’t make visible different performance at 1080P, but changing to 90000 make the demo super slow. Does it mean something?

Thanks,

Please read these threads as well:
https://devtalk.nvidia.com/default/topic/1004649/?comment=5130084
https://devtalk.nvidia.com/default/topic/881722/?comment=5268870
https://devtalk.nvidia.com/default/topic/1030827/?comment=5245692
https://devtalk.nvidia.com/default/topic/1030457/?comment=5242085

Hi Detlef,

Thank you for collecting these threads together. I am more clear now.

I have one more question from https://devtalk.nvidia.com/default/topic/1010533 :
You said the stack size is per GPU core. Did you mean per CUDA core?

On GTX 1080 Ti, there are 128 CUDA cores per SM, if we use 1024 stack size, we use 128 * 2688 (see PS for why) = 344064 (344K) bytes. Considering we have only 96K shared memory and 48K of total L1 cache, this stack size is already big. Could you please confirm?

Thanks

PS: Testing on OptiX 5.1, GTX 1080 Ti, Driver 416.94, Windows 10, a stack size is not multiplied by 5 any more, it varies:

setStackSize -> getStackSize
Set 100 -> Get 384
Set 1000 -> Get 2640
Set 1024 -> Get 2688
Set 2000 -> Get 5136

Hi Detlef,

We can get “Local memory for all threads” by using setUsageReportCallback.
From caculation, the stack size (OptiX 5.1) returned by getStackSize is close to stack size per thread.

Given a big stack size 10240, the getStackSize returns 25728.

Local memory for all threads (starting from Index 1) is 621.7 MB, lets say is 651899700 bytes.
As on my Graphics Card, I have 28 SM, so each SM uses 23282132 bytes.

23282132 / 25728 = 904 threads.

According to NVIDIA Nsight, I have block size 512 threads and maxinum 2 active blocks per SM, so the caculated number is close to maxinum threads per SM.

Tested on other stack size, the result is similar.

getStackSize 2688 -> Local memory 73819750 bytes -> 980 threads.
getStackSize 5248 -> Local memory 138097459 bytes -> 939 threads.

What is almost sure is that, the value returned from getStackSize is more relevent to threads rahter than CUDA cores. Therefore, it uses more local memory than I expected.

Please let me know, if there are anything wrong.

Thanks

You’re right, it’s per thread, not per HW core, which makes minimizing this size even more important.

The thing about that is that you cannot control what is happening inside OptiX because everything about blocks and threads is abstracted away to allow the single ray programming model.

You can only influence the required stack space by how you’re implementing your algorithms. Means if stack space is an issue, prefer iterative over recursive algorithms and try to minimize the number of live variables around rtTrace and callable programs.

The next major OptiX version will contain a new API to set the stack size in number of maximum recursions instead of bytes.