[Resolved] What is in the stack?


I’m trying to understand what is stored in the stack in optix.

As I understand it, we set the stack size per context, and one stack is attached to each thread in the ray generation program. When a ray is launched, the thread carries with it the stack, which stores the ray’s payload.

I thought that, when we do a recursive ray-tracer for example, the stack overflow would occur because there would be too many payloads to keep in the memory. But right now, I have a program with a radiance ray that has a payload of float + 3 uint, and a shadow ray with only a float, and there is only one bounce. However, my stack needs to be bigger that 1024 to avoid a stack overflow. Surely, this is way more that just my two payloads.

So I wonder, what else is in the stack?
(I mean in general, not in my particular case. What is stored in the stack except the ray(s) payload(s) (if they are)? For example, do we also store information about the hits? about the scene tree? Do we keep track of which program called the current ray?)

Thanks for your help!

The stack is also used to save and restore live variables around function calls (e.g. rtTrace or callable programs).
That’s the background for one of the performance advice in the OptiX Programming Guide which starts with “Try to minimize live state across calls to rtTrace in programs.”

It makes more sense now, I see how it’s linked to the advice to use a minimal stack too.
Thank you!

I am also having this stack issue. For my application, I want a new ray to be cast every time the original ray hits the geometry, until the accumulated distance achieves 100 meters. For a point in a scene with a large open area, achieving the 100 meters criterion is not hard. However, stack overflow happens when it comes to narrow corridors(given that each time limitted distance is accumulated). The maximum number of recursions I can achieve right now, with my ray casting recursion scheme(which I think will not reduce any more in size) is 38. I am currently using a GTX 1070, and am wondering if this recursion number can be increased using a better GPU, like 1080ti. I sincerely appreciate anyone who contributes to the answers to this question. Thanks!

If you only continue a single new ray for every hit point, there is no need to do this recursively at all.
You can do that with an iterative path tracer much easier if all you need is the accumulated distance along a path.

Please have a look at the OptiX Advanced Samples on github.com
The new OptiX Introduction examples show how to implement a small and elegant iterative path tracer step by step.
Links here: https://devtalk.nvidia.com/default/topic/998546/optix/optix-advanced-samples-on-github/

For your use case, the optixIntro_04 is already enough to accumulate the traveled distance along a path using a brute force path tracer
The stack size requirement of that implementation is minimal. There is no recursive rtTrace call in that program.
You would just need to change the rtPayload structure to contain the distance and remove the fields you don’t need.

The accumulation done in that application might not even be required, depending on the reflection properties used in your algorithm.

Please work through the rest of the introduction tutorials as well for additional information.

BTW, this also answers your questions about the random number generation in the CUDA forum. All Monte Carlo sampling examples have an implementation of a simple random number generator. There is no need to seed a buffer with cuRAND.
Also if you need to have a relative time inside the device code you can use clock().
Search for the TIME_VIEW define inside the OptiX SDK example source code for examples generating a heat map view with that.

Thank you for replying to me, Detlef. Your solution is great, and I am trying to follow your suggestion. However, originally, I used curand to generate random numbers. When I changed the structure to iterative, the curand functions cannot be used anymore in the program. I am wondering how that happens, and how to avoid that. Thanks.

I haven’t used cuRAND before.
What exactly wouldn’t work anymore when changing an algorithm from recursive to iterative on device side?
You would have the same number of launches and shoot the same number of rays.

Thank you Detlef for following up. I managed to run the iterative ray casting. Previously for shooting the maximum number of recursive rays, I set the stack size to be very big. The large stack size was the exact reason why CURAND is not functioning. As I reduced the previous very big stack size to a small one(because it is iterative now, rather than recursive), the program ran with a much better performance, and I am now able to continue my task.

Although the problem is solved, I still have curiosity about how OptiX is implemented. The RT_Program is a macro of global. I am wondering how OptiX manages these kernel functions. Does it organize them into one big kernel function, using the PTX codes, or does it launch several kernels at different times? And how is the stack structure implemented on the GPU? I appreciate all your answers to these questions.

Here is a paper describing how OptiX worked in 2010.