Working around optix stack limitation with callable programs?

The application I’ve been building with OptiX has some difficulties with the stack limitation. It’s essentially a standard ray-tracer, but with many more wavelengths – hundreds. Since you cannot dynamically allocate the ray package size, I’ve had to set it to some maximum allowable number of wavelengths and play around to make it fit within the stack, limiting my applications capability to a smaller set of wavelengths (about 64). Then if I want to simulate more wavelengths, I essentially need to run the ray tracer multiple times – duplicating calculations and making it slower by a factor of ceil(number_of_wavelengths/64.).

I’ve wondered if the 3.0 addition of CUDA callable programs can help. I’ve been thinking of dynamically allocating memory on the GPU and then having OptiX do the tracing of rays but then pass the handling of the different wavelengths to CUDA with the hope CUDA can then access this large memory store. If this works, I could keep a small stack, handle more wavelengths, and dynamically allocate the right amount of memory.

The examples for callable programs, however, discuss rtDeclareVariable to use variables between optix and cuda and don’t explain a process for doing what I’m suggesting. Is this even possible?

Thanks for the help!

In order to have separate CUDA kernels process the output of an OptiX run, you can just use the new rtBufferGetDevicePointer API to get a raw device pointer for an OptiX buffer, and then pass that pointer to your CUDA kernel.

Alternately, you can allocate the buffer memory yourself before you launch OptiX, and hand OptiX the device pointer. I believe the documentation has examples of doing both.

I’m not sure what aspect of callable programs you’re referring to about using variables between OptiX and CUDA, but it sounds a lot like you want the new CUDA interop functionality to do multi-pass rendering and have CUDA do the work to combine the results of your passes.

Thanks for the response Greg.

After pondering your response and looking in more detail at some of the updated and new samples in the 3.0 release, I think what I want to do is:

  • From my code, start cuda
  • In the cuda code, dynamically setup a device buffer to hold information on the cast rays
  • Call the OptiX camera with the pointer to this device buffer. The OptiX code then calls rtTrace and places the intersection information (angles, material, etc.) into the device buffer
  • CUDA then reads this device buffer and performs the wavelength dependent operations we have over the hundreds of wavelengths

By doing this, I can avoid passing large structures with many wavelengths through the OptiX stack and take advantage of more dynamic memory usage in CUDA.

I am not sure I have conveyed my problem perfectly in the original post (10 minutes and a whiteboard would do it), but let me know if what I’ve said above sounds wrong.

Thanks again for the help.

Yep, the only downside is that you cannot store a pointer to an intersection context. So any and all information that you will need to defer the shading/wavelength computation to later, needs to be stored as well with your intersection info – interpolated attributes, texture samples, context-local attributes etc. Just something to keep in mind – if that adds up to more than the size of the wavelength info it may not be a win…