If you have a small fixed amount of data you need to process, there shouldn’t be an issue with defining a fixed local array with that and do your calculations on it.
It’s a matter of OptiX stack space. If that is inside a recursive algorithm, that could result in quite a stack size increase. Always calculate the OptiX stack size explicitly in your application. (Search the forum for that.)
I don’t think Tensor operator instructions are supported by OptiX device code today.
If you need the full CUDA functionality (shared memory, tensor operators, etc.) for the MLP operations, it might make sense to implement a wavefront renderer where ray generation and shading is handled inside native CUDA kernels.
Described here:
https://forums.developer.nvidia.com/t/deep-learning-and-optix/112722/2
https://forums.developer.nvidia.com/t/lack-of-support-for-threadfence-in-optix-ir/269353/4
You cannot control what variable is stored inside registers. That’s done by the final assembler generating the microcode.
You can change the number of registers your pipeline should use with OptixModuleCompileOptions maxRegisterCount
.
You normally set that to OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT
(== 0) which let’s OptiX decide and that is using 128 registers today but you can try changing that. The maximum is 255 (one register is reserved).
Using more registers can reduce spilling but it also affects how much work can be launched in parallel. Always benchmark that. YMMV.