MLP Evaluation in Closest Hit

What is the best way to perform MLP evaluation within a closet hit shader? Would you just have a large for loop that handles the matrix multiplications? I would only want to use a shallow and narrow MLP, around 32 wide, I imagine. I also can’t imagine warp intrinsics are useful here, considering that I am already parallelizing across different threads using OptiX.

Thanks,
Alex

Also, I vaguely remember some restriction about constant size arrays and registers. If I were to evaluate the MLP within each thread using a for loop, I would need to store the hidden layer values within registers somehow, for maximum performance. How would I go about this?

If you have a small fixed amount of data you need to process, there shouldn’t be an issue with defining a fixed local array with that and do your calculations on it.
It’s a matter of OptiX stack space. If that is inside a recursive algorithm, that could result in quite a stack size increase. Always calculate the OptiX stack size explicitly in your application. (Search the forum for that.)

I don’t think Tensor operator instructions are supported by OptiX device code today.

If you need the full CUDA functionality (shared memory, tensor operators, etc.) for the MLP operations, it might make sense to implement a wavefront renderer where ray generation and shading is handled inside native CUDA kernels.

Described here:
https://forums.developer.nvidia.com/t/deep-learning-and-optix/112722/2
https://forums.developer.nvidia.com/t/lack-of-support-for-threadfence-in-optix-ir/269353/4

You cannot control what variable is stored inside registers. That’s done by the final assembler generating the microcode.
You can change the number of registers your pipeline should use with OptixModuleCompileOptions maxRegisterCount.
You normally set that to OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT (== 0) which let’s OptiX decide and that is using 128 registers today but you can try changing that. The maximum is 255 (one register is reserved).
Using more registers can reduce spilling but it also affects how much work can be launched in parallel. Always benchmark that. YMMV.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.