Minimizing stack size depending on user's application

Hi, I have developed a radiometric accurate Optix kernel that traces rays through unstructured triangularized volumes consisting of multiple molecular species. This solution works well but can require considerable stack memory (approx 1 KB) depending on the number of wavelengths and molecular species used in the per-ray radiance transport integration equations.

My problem is that if the per-ray stack memory is dimensionalized to the worse-case scenario, the solution will run much slower than if the stack is dimensioned to the actual number of wavelengths and molecules defined in the user’s inputs.

My currently solution is to dynamically build the kernel during program initialization by passing various “defines” to the nvcc compiler to dimensionalize the stack. The drawback of this approach is that it requires the end-user to become a licesed Optix user and to download its libraries (a code-dependency that some users may not want to do).

Alternatively, I can pre-build multiple kernels for different stack sizes and deliver that with my release packages. This is doable but leads to messy packaging and maintance problems when users request configurations that haven’t been pre-built.

Any advice that can be provided would be greatly appreciated as I move my code from a Research to Enterprise application.

Thanks,

Dennis

Hi @crow,

Have you looked into using nvrtc? This would allow you to move some or all of your device compilation from build time to run time, and this way you could inject the right defines into the source at run time and then re-compile on the fly. The nvrtc library is something that gets packaged with your application, preventing the need for users to install anything. The main issue this tends to bring up is code security, since you will have to ship your kernel source with your application. Some groups prefer to encrypt their kernel source so that it’s not accessible to their users. With nvrtc you won’t need any kernel code changes, you’ll just need to sort out the dynamic define injection.

Another possibility perhaps is OptiX parameter specialization. This lets you create a subset of your launch parameters that become compile time constants, and can be used to elide unused features or further optimize the code during the OptiX pipeline creation. Specialized variables aren’t the same as defines, but they can be very similar in many cases. Parameter specialization happens during the PTX to SASS phase, so you can use this approach with either nvcc or nvrtc. With OptiX parameter specialization, you’ll need some minor code changes in the host-side pipeline setup to identify the launch params to be specialized, and if needed, moving any relevant variables to specialize into your launch params.

Let us know if either of these approaches might do the trick, or whether we should brainstorm other ideas.


David.

Compiling OptiX device code at runtime always requires the OptiX SDK on the target system, no matter what compiler you use for that.

When using NVCC at runtime, you would also need a CUDA Toolkit installed on the target system, so this is not only about one SDK dependency in your current approach.

NVRTC comes with the CUDA Toolkit in form of an API header and some dynamic redistributable libraries you need to ship with your application. It only supports device code translation to PTX or OptiX-IR. It’s usually faster than NVCC because it doesn’t generate temporary files on disk.
One of the redistributables contains the CUDA standard library functions which might alleviate the need to have a CUDA Toolkit installed for the CUDA headers on the target system. (It’s quite some time ago since I used that.)

Now, if your ray tracing algorithm requires a lot of memory per ray, the question is, if there would be different ways to manage that memory.
For example, you could allocate a big device buffer and implement a small allocator algorithm using a simple atomic which just gets a properly aligned pointer to the required memory size per ray at the start of the ray generation program.
That would immediately remove all “local depot” memory (read your PTX code) you currently need for your local stack definition.
A smaller local depot could also result in Shader Exececution Reordering (SER) on Ada being beneficial in case you want to use that.

You would just need to make that allocation mechanism robust enough handle out of memory conditions.
For example, when the required memory exceeds the pre-allocated device buffer size, you would determine that by looking at the allocator state (the “next free pointer variable” inside a device buffer the atomic would use to grab chunks) and if that value is higher than the pre-allocated size, the last launch was incomplete and needs to be redone with a bigger device buffer.
There would be a one time congestion of the atomic access inside the ray generation program to get chunks per ray, but after that the algorithm shouldn’t be affected by the different local depot sizes of different runs you saw before.

The above idea assumes your stack requirement is targeting an iterative algorithm, not a recursive one where more stack would be grabbed per recursion inside closest hit programs.
That “dynamic” memory allocation approach would not require any recompilation of the OptiX device code.

Thanks for the good information.

I have been thinking about ways to reduce stack memory by using a global memory allocation scheme instead. If I understand you correctly, atomics could be used to effectivey reuse this memory for multiple rays since not all rays are being processed concurrently. If this is correct, then are there any rules-of-thumb to estimate how many rays can be concurrently processed to minimize the global memory buffer?

As for the type of raytracing algorithm being used, it is an iterative alogrithm and isn’t recursive.

Thanks again, Dennis

The concurrency depends on your GPU & kernel, e.g., register usage, shared memory, etc. [1][2]. The stat you want for determining the right size of your buffer is the “Theoretical Occupancy” [3]. Nsight Compute can tell you what it is for your OptiX or CUDA kernels, but I don’t know if there’s an API call or function that will do it, and it’s slightly complicated for OptiX since you don’t have control over the launch dimensions or shared memory size. I’ll ask around and see if there are any concrete suggestions. In the mean time, you could try using a conservative size just to make sure this approach will work.

[1] https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-multithreading

[2] https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

[3] https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm


David.

If this is correct, then are there any rules-of-thumb to estimate how many rays can be concurrently processed to minimize the global memory buffer?

I wouldn’t make that so complicated. Just keep it simple and try if this works first:
If you had local memory requirements per ray inside your ray generation program which held the 1 kB stack space you need, then simply allocate a buffer with size launch dimension (== number of threads) * 1 kB and then each launch index (ray path) gets a nicely aligned power-of-two block of 1024 bytes inside that buffer you can calculate the address for even without needing atomics. You’d never free that again inside the launch anyway!
Atomics would only come into play if the memory block sizes per ray could be different, but from your descriptions it sounds as if that only changes between launches due to different scenes.

And if you need different memory amounts per scene, you can simply control that with some size value inside the launch parameters. Just make sure that the resulting addresses of your blocks are properly aligned to the data types you write into them, like float4 needs 16 bytes aligned addresses, etc.

So let’s say you have a launch dimension of 1024 * 1024 and you need 1024 bytes per launch index, that would be exactly 1 GB. If that is too big for your system configuration (it shouldn’t or the local memory version wouldn’t have worked either), you can also partition the task into multiple optixLaunch calls with smaller launch dimensions.
That would effectively be the same memory OptiX used internally before that change.

Slight correction, OptiX and CUDA won’t exceed the space needed for the maximum number of resident threads the hardware can support. They do not necessarily allocate enough local memory for all threads in a grid. Here’s a comment illustrating exactly how the local memory buffer size is calculated in CUDA:

OptiX uses a little bit of extra stack space, meaning the LOCAL_LMEM_PER_THREAD might be higher in OptiX than CUDA, but otherwise the lmem buffer size is calculated in OptiX the same way as CUDA. So you could do something similar with a global memory allocation using the tables I posted links to above. For example, an AD102 chip has 144 SMs, and a maximum of 1,536 resident threads per SM, so on an RTX 6000 Ada you can safely allocate space for 221,184 threads. Your kernel might have a lower maximum number of threads per SM. You can use the occupancy calculator in Nsight Compute to check.

How you might index into this buffer and whether/how to use atomics to reserve and release your blocks may be slightly trickier, but you may be able to use some combination of the SM id [1] and your block id and thread id to find a unique slot in your global memory buffer. Or you can use a conservative amount of memory and use atomics to treat it like a ring buffer, and let your warps reserve space in first-come-first-served order.

Do note that the approach Detlef suggested is much easier than what I’m talking about, as long as you have the memory for it. If you can allocate space for all threads in your grid, it will be extremely simple. If the memory usage you’ll need for the grid exceeds your available space or the amount of VRAM on your device, then you can dive into the size calculation and some kind of fancy indexing scheme. I will say that when I ask CUDA experts about this kind of thing, they tend to recommend the simple pool-allocator or ring-buffer approach using slightly more memory than needed over trying to compute the absolute smallest size and get the indexing right.

[1] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=smid#special-registers-smid


David.

Thank you everyone for all the good information… Many things I need to think about further.

When I implement a simple approach using global memory with each ray having its own, propertly aligned, chunk of globlal memory, the solution runs about 1.5x slower as compared to an equivlent implementation using stack memory.

The nature of the application is Optix is used to cast rays through a volumetric “polygon-soup” of triangles using repeated optixTrace() calls to walk the ray through the volume. An integration is performed at each step to determine the cummulative spectral radiance of the ray. Some psuduo-code is shown below:

For each triangle intersection  // Dozens of intersections
  For each wavelength      // 8, 16, or 32 wavelengths
        RTData rtData = rtData_gmem[iray * nwav + iwav] // struct size = 32, 48, or 64 bytes
        Integrate(rtData)
        rtData_gmem[iray * nwav + iwav] = rtData
  End wavelengths
End triangles

The slowdown occurs when rtData is written back into global memory (line 5).

I like the idea of using global memory since then the internal, per-ray, arrays can be dynamically allocated. So my thinking now is to develop a hybrid approach with a smaller amount of static stack memory coupled with some dynamic global memory to deal with complex scenarios a user might request (this may involve redundant calls to optixTrace()).

Any suggestions on this particular optix use-case would be greatly appreciated.

Again, thank you for all the responses.

Dennis

Hey Dennis,

So that result is a little surprising in the sense that there isn’t a big hardware difference between global memory and local memory. Stack (local memory) is just global memory with a special indexing mode and automated allocation management to make it easy to use. Another reason this is a little surprising is that memory writes are less likely to stall your code, unless there is a readback dependency, or unless the hardware memory system is saturated in the number of transactions. Unlike reads/loads that have to wait for the value to finish loading, stores don’t, and so it’s less common to see a stall due to a store.

I can imagine a couple possible reasons why it might be slower when using global memory. The first possibility is that when you use the stack, the compiler has identified a decent number of the most used values and allocated registers for them rather than storing the values in memory. Another possibility is that something else changed and we’re not quite comparing apples to apples anymore. Maybe there is a different coalescing pattern and the gmem accesses are queueing in some suboptimal ordering.

My first suggestion would be to dive into using Nsight Compute and try figure out why the new version is experiencing such a significant slowdown. Just in case you haven’t used Nsight Compute before, take a look at the ‘baseline’ feature: you can run your kernel once each in both modes (lmem vs gmem) and open the two profiles and Nsight Compute’s baseline feature can be used to compare them and help you quickly spot the most significant differences.

If you find that the number of memory transactions is the bottleneck, there may be opportunities to combine multiple writes into larger stores with fewer instructions, and speed things up. Check to see how the rtdata struct write is being translated into SASS and whether you could benefit from 64bit or 128bit I/O instructions.

Or maybe the write at the end of the loop is causing the read at the beginning of the loop to have to serialize and stall? You might do an experiment to see if alternating between two cache lines removes the stall. For example, a cache line is 128 bytes, so if your struct size is 32 bytes, maybe process the wavelengths this in this order: 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, … This way each read doesn’t have to wait for the previous write to finish immediately. This might be a silly idea, and I’m speculating wildly, but see if you can pin down exactly where the stall is occurring and think about ways you might fix it.

https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#baselines


David.