GPU max trace depth and stack size metrics

Hello,

we’re working with an optical engineering software for non sequential raytracing simulations. It offers the possibility to run the simulation on a GPU and uses the Optix library. One of the configurable settings is the number of splits to consider in a trace which translates to the trace depth in Optix.

What we experience is that for a single precision simulation the max trace depth is 10 even for the most simple geometry on a RTX A4500. The application hits a limit when it creates the ray tracing pipeline with a larger value.

After skimming through the Optix documentation my understanding is that this is related to the total stack size that the pipeline would require for the trace. Since this seems to only depend on the programs of which the pipeline is composed and the trace depth but NOT the size of the scene graph (BVH), it would explain the observed behavior. When performing a double precision trace, the limit is 5, i.e. half of the single precision run which I would also expect according to my understanding.

Although the stack size required by a pipeline for a particular call graph is hard to predict because it depends on so many parameters there must be some limited resource on the GPU which determines when the requirement is too high.

I would like to understand which resource this is (memory, L2 cache, L1 cache, … ?) to enable an educated decision which GPU to chose for extending the trace depth limit for a particular situation. The capability of the RTX A4500 is unfortunately too low in our particular case.

I fear that this won’t change much with different GPUs.
There is a hard limit of 64kB for the OptiX stack size due to internal addressing methods per active thread.

For recursive algorithms, that can quickly become a bottleneck if the user’s device code uses a lot of program-local data and/or live variables across continuation calls (optixTrace and continuation callables).
How much VRAM memory overall is needed for all stack space inside an OptiX kernel on a specific GPU during runtime depends on the amount of CUDA cores and the actual device code.

If doubling the data size (going from float to double data types) already halves the amount of splits that application can handle in its algorithm, then it’s probably using quite some program-local data inside a recursive algorithm already. (BTW, the maximum recursion depth limit in OptiX is 31.)

Without knowing how the application implements its algorithm, it’s hard so say if that can be improved.
It’s recommended that OptiX applications always calculate the minimum required pipeline stack size. The OptiX SDK provides some helper functions for that inside a header.
Converting a recursive algorithm to an iterative algorithm usually reduces the required stack space because the maximum recursion depth could become 1 or 2, but that would require drastic changes inside the software’s device programs which only the software vendor could do.

1 Like

Thank you for this information.
If I understand you correctly that means that the the stack size you set via ‘optixPipelineSetStackSize’ can never exceed 64kB?

If that’s the case, I’m surprised that this is nowhere mentioned in the Optix programming guide because it seems to me that this limit will hit you in almost all cases much earlier than the maximum recursion limit of 31.

What I’m also wondering is the following.
In the programming guide they say that

“A pipeline may be reused for different call graphs as long as the set of programs is the same. For this reason, the pipeline stack size is configured separately from the pipeline compilation options.”

As far as I understand it this means that one and the same RT pipeline can consume more or less stack space depending on the trace depth but also depending on which of the component programs are used in a particular trace. So if you have a pipeline setup to trace reflections and scattering but would only set the reflections tracing program components active, the pipeline would require less stack space than if you had activated both, i.e. reflections and scattering. Is that understanding correct?

Well, optixPipelineSetStackSize doesn’t take a single size as argument but calculates the overall stack size from the different direct, continuation and depth arguments plus anything OptiX needs internally.
I don’t recall if the OptiX validation mode provides information about that when enabling the log callback at highest level 4.

The chapter 6.8 about the stack size you cite explains some cases where the optixPipelineSetStackSize size arguments can be tuned to the minimum required size by the developer with knowledge about the effective program call graph. It’s a little involved.

The optixLaunch call associates pipelines and shader binding tables.
If you want a pipeline to have the minimum amount of resources and stack space for different program call graphs it might make sense to actually implement different specialized pipelines.
OptiX actually provides a mechanism for that via the Parameter Specialization

1 Like

Yes the actual stack usage depends on which functions you call. The stack size that you set via the API is reserving the maximum safe amount you might need in order to not run into stack corruption, but does not mean you will use all the space reserved.

The snippet you mentioned about reusing pipelines is saying that you can set the max stack size you might need, and then toggle features on and off and keep using the same pipeline without redoing your stack settings. This assumes you’re okay with the total stack memory consumption, and don’t need to reclaim the memory when you toggle various features.

As far as the 64kB stack size limit, in case it wasn’t clear, this is a per-thread limit. Usually for good latency hiding, you’ll want to be able to handle several threads per CUDA core. So as an example, using 64kB of stack on a 5090 with only 3 threads per core would consume 4GB for stack alone. That comes out of your global memory budget, so it’s good to limit usage when possible.

Yes if you have a large stack, that will in turn limit your maximum recursion depth to less than 31. We recommend using iterative ‘path tracing’ style algorithms when possible, rather than recursion, specifically to avoid high stack usage. Is that possible with your project? Do you have a branching factor greater than 1 with your splits? If each split means ending 1 ray and starting 1 new ray, then you can probably switch to iterative quite easily, and save a lot of stack space. Commercial renderers based on OptiX tend not to employ very much recursion and so do not bump into the per-thread stack limit, even though they are tracing deep paths with sometimes even more than 31 ray segments.

David.

1 Like

Thank you David for the explanation.

The problem is that the optical engineering software that I mentioned in the opening post is a commercial product and therefore we don’t have insight into the code they’re using.
It is an application for simulating complex optical systems including the mechanics to calculate important physical properties like the irradiance on different areas of a detector surface. Therefore the ray tracing pipeline will eventually have to deal with all possible interferences of a ray with a surface, i.e. reflection, transmission, absorption, reflective scatter and transmissive scatter in a physically correct way.

In the past the application did do all this on the CPUs but in the meantime it offers the possibility to perform the ray tracing operation also on nvidia GPUs using the Optix library. It offers two different modes for the ray tracing. One is Monte Carlo, which does end 1 ray when starting the next and with this mode there is no problem with the number of splits, therefore I would assume that they are using an iterative algorithm according to your explanation.

However, the Monte Carlo method is not applicable in all cases and for some simulations we require the branching factor to be greater than one. As far as I understand, this will require the recursive algorithm. You can configure the number of splits on each surface and the recursion depth (split level) for a simulation.

What we experience is that even with the most simple geometrical setup of a point light source which illuminates a transmissive scattering plane and a detector surface behind this plane, the maximum split level we can set on a RTX A4500 is 5 for double precision and 10 for single precision. This did surprise us because in fact with this setup the largest split level that the ray tracing has to deal with is just 1, the transmissive scattering of the first plane. The detector plane is fully absorbing.

This was the point were I started to dive into the Optix documentation and found the stack size setting. My theory is that the application is calculating the stack size it would need for the whole, fully sophisticated ray tracing pipeline and tries to set the stack size based on this for the trace run with the selected trace depth (split level), no matter if it will have to use this in the actual trace or not.

The error that the application returns when exceeding the max split level is:

failed: MPCTasks:1386 OPTIX_ERROR_INVALID_VALUE

I would assume that this is most probably an application internal error code, right?

Based on what you explained it looks like my theory would explain the observed behavior if I didn’t misunderstand something.

What I still don’t understand is what physical property of a GPU does determine the stack size limit? I would like to understand if using a different GPU than our current RTX A4500 would improve the max split level. Or is it the 64K limit per thread which as far as I understood it would impose the same limit for all nvidia GPUs?

Your theory sounds both plausible and likely. I can’t say for sure without talking to the developer, but you’re probably right about the pipeline. For reducing stack size or exploring other options, it might be worth contacting the developer and having a conversation, especially if you’re paying for this software. They may be able and willing to make simple changes that could reduce stack sizes considerably, at least for the cases you mentioned where it appears the reserved stack space is not actually being used.

Yes, the invalid value error is an internal application error code. The output you’re seeing is most likely a debugging feature intended mainly for the developers to use.

The stack size limit is currently the same for all NVIDIA GPUs, and this applies to ray tracing APIs (DirectX, Vulkan, OptiX) but not to CUDA. It could theoretically change in the future, but we haven’t had much demand for that yet and it would take quite a bit of time and effort before it could happen (and the developer would need to be involved to make upgrades on their side as well). Keep in mind that there’s kind of a narrow space for having more than 64kB of stack per thread, since that will start at more than 1GB of stack. Thinking about it on a log scale, we don’t have to increase it much before you’re severely limiting your application’s global memory and/or running out of VRAM entirely. We are also considering an API that would let developers potentially move stack usage to global memory, which might be a solution to your problem.

So just out of curiosity, is the Monte Carlo solution too noisy for you, is that why it’s not always applicable? Or does the application not handle all cases? Monte Carlo is exactly what I was thinking when mentioning iterative algorithms. In theory, Monte Carlo can handle dealing with all possible ray events (reflection, absorption, transmission, etc.) and can always substitute for branching factors higher than 1, but of course whether that’s achievable in your case depends on both the application and on your goals and tolerances. Just wondering out loud if there is any potential to go this direction since in theory it solves all the stack size problems at the cost of added noise and the time needed to resolve the noise.

I do recommend communicating your findings and concerns to the developer, if you haven’t already. They can in turn reach out to us if they need engineering questions answered about stack usage, but I would speculate they have ways they could help you at their disposal that the OptiX team doesn’t have.

I hope that helps. Thanks for the explanation and good luck!

David.

1 Like

So just out of curiosity, is the Monte Carlo solution too noisy for you, is that why it’s not always applicable? Or does the application not handle all cases?

There are two issues with Monte Carlo.

One is the time required to generate the source rays. For some reason, which we don’t understand, you can only generate the source rays for some source types on the GPU. For most of the source types we use they are generated on the CPU. The generation of the source rays on the CPU takes a substantial amount of time, which is something we also do not understand, at least not why it takes that long. However, with Monte Carlo you have to generate orders of magnitude more source rays than with the branching method. This can eat up most of the benefit you get by running the RT on the GPU.

The other reason is that using the branching method, you can per surface define the number of child rays that should be generated. After investigating a first draft simulation run you can often identify spots which you would like to investigate in more detail, i.e. sample with a much higher number of rays. You can also identify the next surface structure from which these spots originate. So with the branching method you give those surfaces a higher number of child rays that will be generated if they are hit. Thus you can keep the total number of rays that need to be traced at a manageable level while reducing the noise in the particular areas that you’re interested in. With Monte Carlo you cannot do this.

1 Like