Minimizing stack size depending on user's application

Compiling OptiX device code at runtime always requires the OptiX SDK on the target system, no matter what compiler you use for that.

When using NVCC at runtime, you would also need a CUDA Toolkit installed on the target system, so this is not only about one SDK dependency in your current approach.

NVRTC comes with the CUDA Toolkit in form of an API header and some dynamic redistributable libraries you need to ship with your application. It only supports device code translation to PTX or OptiX-IR. It’s usually faster than NVCC because it doesn’t generate temporary files on disk.
One of the redistributables contains the CUDA standard library functions which might alleviate the need to have a CUDA Toolkit installed for the CUDA headers on the target system. (It’s quite some time ago since I used that.)

Now, if your ray tracing algorithm requires a lot of memory per ray, the question is, if there would be different ways to manage that memory.
For example, you could allocate a big device buffer and implement a small allocator algorithm using a simple atomic which just gets a properly aligned pointer to the required memory size per ray at the start of the ray generation program.
That would immediately remove all “local depot” memory (read your PTX code) you currently need for your local stack definition.
A smaller local depot could also result in Shader Exececution Reordering (SER) on Ada being beneficial in case you want to use that.

You would just need to make that allocation mechanism robust enough handle out of memory conditions.
For example, when the required memory exceeds the pre-allocated device buffer size, you would determine that by looking at the allocator state (the “next free pointer variable” inside a device buffer the atomic would use to grab chunks) and if that value is higher than the pre-allocated size, the last launch was incomplete and needs to be redone with a bigger device buffer.
There would be a one time congestion of the atomic access inside the ray generation program to get chunks per ray, but after that the algorithm shouldn’t be affected by the different local depot sizes of different runs you saw before.

The above idea assumes your stack requirement is targeting an iterative algorithm, not a recursive one where more stack would be grabbed per recursion inside closest hit programs.
That “dynamic” memory allocation approach would not require any recompilation of the OptiX device code.