Optix Neural Primitives Rendering with Cooperative Vectors and SER

Hi there I am interested in efficient OptiX API utilization for neural rendering in the context of neural primitives.

What would be the most efficient way to structure a scene of many (probably hundreds or even thousands) tiny MLPs, and evaluate them in a volumetric rendering context via SER and Cooperative Vectors?

The programming guide provides intuitions for a global NeRF-like implementation, but what would be efficient in a memory and speed-wise manner in a neural primitive context?

Thanks for your time.

Hi @Perukas, this is a good question.

Do you expect to be able to achieve some reasonable warp coherence in your MLP weights, for example using SER, or do you expect to have so many MLPs that each thread on average won’t be able to share weights even if you sort?


David.

Hi @dhart , thanks for the response.

The idea is to have Neural Primitives, so one tiny MLP per primitive. The primitives themselves could be bounded by AABBs or Spheres, and at each primitive hit the MLPs would be evaluated for some local sample points along the ray.

Regarding SER, from what I understood, threads that hit the same primitive could share MLP weights (but maybe I am wrong). The optimal thing would be to achieve some reasonable warp coherence for the MLP weights of course. But not sure yet. Just want to try it to assess the feasibility.

The thing I am stuck with, is pytorch-optix interop, and specifically how to structure weight and bias buffers in pytorch for a smooth conversion via the optix-host api, and efficient access in device code for N MLPs.

Perukas.

One way to handle this would be via the CuPy API. It allows for easy Python host-side generation of device arrays of arbitrary type. It also has a wrapper around the CUDA runtime API which you could use for lower level array management. If you might take a look at the optixpy project which wraps the OptiX C API in python. There are a few simple example codes which use CuPy to interface between Python and CUDA-C optix programs.

I’m wondering, if you have many tiny MLPs and need efficiency, have you considered tiny-cuda-nn instead of pytorch and interop?

HI @Keith_Morley,

Thanks for the reply, I have already implemented python bindings for OptiX in my library. This is not the issue. What I was wondering about is the following:

The programming guide on training with Cooperative Vectors mentions about OPTIX_COOP_VEC_MATRIX_LAYOUT_TRAINING_OPTIMAL for training. However, in order to use this, one must call the host api to convert standard weights buffers to this opaque swizzled format. So in every optimization iteration one must convert standard fp16 buffers to TRAINING_OPTIMAL, perform MLPs evaluation on device, and then convert back to propagate gradients (i.e., if I have understood the api usage correctly).

Regarding this, my question would be, is this efficient? Maybe zero-copy pointers to row-major buffers directly to pytorch device data might be faster.

Additionally, the usage of the API for my use-case is not straightforward. The programming guide specifies an example for a single neural net. It is not clear if I have to do this for hunders of MLPs.

Thanks again for your time.

Perukas

Hi @lspano,

Thanks for the recommendation. I have been using tiny-cuda-nn (python bindings) for some time now. I haven’t really checked though if it could “interop” in optix programs. I think it is not straightforward how to “glue” tiny-cuda-nn networks in optix.

I thought that since OptiX 9.1 provides cooperative vector features it would possible to implement my use-case.

Thanks,

Antonis

You can use OPTIX_COOP_VEC_MATRIX_LAYOUT_TRAINING_OPTIMAL for both training and for forward and backward propagation. The author of the coop vec SDK sample pointed to this code snippet as an example:

extern "C" __global__ void __raygen__render_image()
{
    ...
    // Now compute the forward pass
    VecTIn* x_r_base = reinterpret_cast<VecTIn*>( params.base ) + p_idx;
    VecTIn x_r       = optixCoopVecLoad<VecTIn>( x_r_base );
    half* w      = params.weights.d_weights;
    half* b      = params.weights.d_biases;

    using namespace tin;
    using LinearTrainer16x32 =
        LinearTrainer<Act::RELU, Act::NONE, half, AccType, OPTIX_COOP_VEC_MATRIX_LAYOUT_TRAINING_OPTIMAL,     INPUT_SIZE, HIDDEN_SIZE, HIDDEN_SIZE, OUTPUT_SIZE>;
    LinearTrainer16x32 trainer( w, b, nullptr, nullptr );
    VecTOut out = trainer.forward( x_r );
    writePixel( out, idx, output_w, /* offset */ dim.x * PIX_MULT );
}

Also, the conversion code can handle multiple matrices at the same time. optixCoopVecMatrixConvert takes numNetworks which can be the number of MLPs you are converting as long as they share the same inputNetworkDescription.

Thank you for the snippet. I will try something similar out, get back with my findings.

So it turns out that using cooperative vectors is really trickier than I thought. Mainly because of the (justifiable) alignment requirents.

However, I developed a toy example implementing my own linear layer forward and backward passes, and it runs pretty decently. Probably not at the frame rate I imagined on a 4090 (but maybe on a 5090 someone can achieve better performance).
I am using an anyhit only volumetric tracer, so correct me if I’m wrong, but SER does not work for anyhit only pipelines right? you need to work through the closesthit for achieving ray coherence.

Thanks again,

I will post more details as development progresses.

1 Like

Very cool to hear it’s working!

You can use SER with a pipeline that only has anyhit shaders, but it might be hard to figure out how to make SER help. You can only reorder from a raygen program, so you can reorder before you cast the rays, and you have to figure out how to sort rays that will help make all your subsequent anyhit invocations more coherent. What is your main divergence problem, is it branching (like neighbor threads hitting different kinds of geometries & materials), or is it more about data divergence? If you have any special or non-standard volume data structures, it might be possible to sort before trace to help with coherent data access.

If you suspect being able to reorder might help, there are ways to try it. Note these might give you new ways to think about it, but these aren’t necessarily going to improve performance. One way would be to move your anyhit code into raygen, and use optixTraverse for every hit along the ray, with optixReorder in between. Unless you have a bad divergence problem, I would expect this to be slow. Another thing you might do is process hits batches, and stop and reorder before relaunching the ray with a new tmin, and starting a new batch.

It’s possible that using a Blackwell might help improve perf. It might also be possible to improve perf on Ada, depending on whether you’ve achieved the best perf possible or not – optimizing cooperative vectors can be a tiny bit tricky. Even on Ada, make sure you don’t have any unnecessary shuffling, and check if you can use INT8/FP8 over FP16. You can probably estimate your tensor ops throughput and check to make sure you’re at least approaching the published specs. It might be half of the theoretical max, and that might as good as you can get when mixing ray tracing and CUDA and tensor ops, but it shouldn’t be, for example, an order of magnitude less than theoretical max.


David.