Optix Neural Primitives Rendering with Cooperative Vectors and SER

Perukas · February 17, 2026, 2:42pm

Hi there I am interested in efficient OptiX API utilization for neural rendering in the context of neural primitives.

What would be the most efficient way to structure a scene of many (probably hundreds or even thousands) tiny MLPs, and evaluate them in a volumetric rendering context via SER and Cooperative Vectors?

The programming guide provides intuitions for a global NeRF-like implementation, but what would be efficient in a memory and speed-wise manner in a neural primitive context?

Thanks for your time.

dhart · February 17, 2026, 7:54pm

Hi @Perukas, this is a good question.

Do you expect to be able to achieve some reasonable warp coherence in your MLP weights, for example using SER, or do you expect to have so many MLPs that each thread on average won’t be able to share weights even if you sort?

–
David.

Perukas · February 17, 2026, 8:03pm

Hi @dhart , thanks for the response.

The idea is to have Neural Primitives, so one tiny MLP per primitive. The primitives themselves could be bounded by AABBs or Spheres, and at each primitive hit the MLPs would be evaluated for some local sample points along the ray.

Regarding SER, from what I understood, threads that hit the same primitive could share MLP weights (but maybe I am wrong). The optimal thing would be to achieve some reasonable warp coherence for the MLP weights of course. But not sure yet. Just want to try it to assess the feasibility.

The thing I am stuck with, is pytorch-optix interop, and specifically how to structure weight and bias buffers in pytorch for a smooth conversion via the optix-host api, and efficient access in device code for N MLPs.

Perukas.

Keith_Morley · February 17, 2026, 8:36pm

One way to handle this would be via the CuPy API. It allows for easy Python host-side generation of device arrays of arbitrary type. It also has a wrapper around the CUDA runtime API which you could use for lower level array management. If you might take a look at the optixpy project which wraps the OptiX C API in python. There are a few simple example codes which use CuPy to interface between Python and CUDA-C optix programs.

lspano · February 17, 2026, 8:37pm

I’m wondering, if you have many tiny MLPs and need efficiency, have you considered tiny-cuda-nn instead of pytorch and interop?

Perukas · February 17, 2026, 8:47pm

HI @Keith_Morley,

Thanks for the reply, I have already implemented python bindings for OptiX in my library. This is not the issue. What I was wondering about is the following:

The programming guide on training with Cooperative Vectors mentions about OPTIX_COOP_VEC_MATRIX_LAYOUT_TRAINING_OPTIMAL for training. However, in order to use this, one must call the host api to convert standard weights buffers to this opaque swizzled format. So in every optimization iteration one must convert standard fp16 buffers to TRAINING_OPTIMAL, perform MLPs evaluation on device, and then convert back to propagate gradients (i.e., if I have understood the api usage correctly).

Regarding this, my question would be, is this efficient? Maybe zero-copy pointers to row-major buffers directly to pytorch device data might be faster.

Additionally, the usage of the API for my use-case is not straightforward. The programming guide specifies an example for a single neural net. It is not clear if I have to do this for hunders of MLPs.

Thanks again for your time.

Perukas

Perukas · February 17, 2026, 8:50pm

Hi @lspano,

Thanks for the recommendation. I have been using tiny-cuda-nn (python bindings) for some time now. I haven’t really checked though if it could “interop” in optix programs. I think it is not straightforward how to “glue” tiny-cuda-nn networks in optix.

I thought that since OptiX 9.1 provides cooperative vector features it would possible to implement my use-case.

Thanks,

Antonis

Keith_Morley · February 17, 2026, 11:40pm

You can use OPTIX_COOP_VEC_MATRIX_LAYOUT_TRAINING_OPTIMAL for both training and for forward and backward propagation. The author of the coop vec SDK sample pointed to this code snippet as an example:

extern "C" __global__ void __raygen__render_image()
{
    ...
    // Now compute the forward pass
    VecTIn* x_r_base = reinterpret_cast<VecTIn*>( params.base ) + p_idx;
    VecTIn x_r       = optixCoopVecLoad<VecTIn>( x_r_base );
    half* w      = params.weights.d_weights;
    half* b      = params.weights.d_biases;

    using namespace tin;
    using LinearTrainer16x32 =
        LinearTrainer<Act::RELU, Act::NONE, half, AccType, OPTIX_COOP_VEC_MATRIX_LAYOUT_TRAINING_OPTIMAL,     INPUT_SIZE, HIDDEN_SIZE, HIDDEN_SIZE, OUTPUT_SIZE>;
    LinearTrainer16x32 trainer( w, b, nullptr, nullptr );
    VecTOut out = trainer.forward( x_r );
    writePixel( out, idx, output_w, /* offset */ dim.x * PIX_MULT );
}

Keith_Morley · February 18, 2026, 12:29am

Also, the conversion code can handle multiple matrices at the same time. optixCoopVecMatrixConvert takes numNetworks which can be the number of MLPs you are converting as long as they share the same inputNetworkDescription.

Perukas · February 18, 2026, 5:19am

Thank you for the snippet. I will try something similar out, get back with my findings.

Perukas · February 20, 2026, 6:33am

So it turns out that using cooperative vectors is really trickier than I thought. Mainly because of the (justifiable) alignment requirents.

However, I developed a toy example implementing my own linear layer forward and backward passes, and it runs pretty decently. Probably not at the frame rate I imagined on a 4090 (but maybe on a 5090 someone can achieve better performance).
I am using an anyhit only volumetric tracer, so correct me if I’m wrong, but SER does not work for anyhit only pipelines right? you need to work through the closesthit for achieving ray coherence.

Thanks again,

I will post more details as development progresses.

dhart · February 20, 2026, 5:10pm

Very cool to hear it’s working!

You can use SER with a pipeline that only has anyhit shaders, but it might be hard to figure out how to make SER help. You can only reorder from a raygen program, so you can reorder before you cast the rays, and you have to figure out how to sort rays that will help make all your subsequent anyhit invocations more coherent. What is your main divergence problem, is it branching (like neighbor threads hitting different kinds of geometries & materials), or is it more about data divergence? If you have any special or non-standard volume data structures, it might be possible to sort before trace to help with coherent data access.

If you suspect being able to reorder might help, there are ways to try it. Note these might give you new ways to think about it, but these aren’t necessarily going to improve performance. One way would be to move your anyhit code into raygen, and use optixTraverse for every hit along the ray, with optixReorder in between. Unless you have a bad divergence problem, I would expect this to be slow. Another thing you might do is process hits batches, and stop and reorder before relaunching the ray with a new tmin, and starting a new batch.

It’s possible that using a Blackwell might help improve perf. It might also be possible to improve perf on Ada, depending on whether you’ve achieved the best perf possible or not – optimizing cooperative vectors can be a tiny bit tricky. Even on Ada, make sure you don’t have any unnecessary shuffling, and check if you can use INT8/FP8 over FP16. You can probably estimate your tensor ops throughput and check to make sure you’re at least approaching the published specs. It might be half of the theoretical max, and that might as good as you can get when mixing ray tracing and CUDA and tensor ops, but it shouldn’t be, for example, an order of magnitude less than theoretical max.

–
David.

Topic		Replies	Views
Neural Rendering in NVIDIA OptiX Using Cooperative Vectors Technical Blog	1	82	April 17, 2025
코퍼레이트 벡터를 사용한 NVIDIA OptiX의 뉴럴 렌더링 Technical Blog - South Korea	1	45	April 25, 2025
OptixCoopVec outside raytracing pipeline OptiX	1	76	July 11, 2025
Struct of vectors instead of vector of structs in Optix API OptiX	6	1629	June 14, 2022
Question about handling buffers when using multiple GPUs? OptiX	14	4116	June 15, 2022
Tensor Core does not seem to be involved in OptiX 9 OptiX	3	96	March 25, 2025
Support for the TF32 type and the mixed precision OptiX	2	121	November 11, 2025
MLP Evaluation in Closest Hit OptiX	3	454	March 15, 2024
Deep learning and OptiX OptiX	2	1191	June 14, 2022
Comparing Optix performance to CUDA OptiX	20	6434	June 14, 2022

Optix Neural Primitives Rendering with Cooperative Vectors and SER

Related topics