Is it possible to call optixTrace from custom intersection?

Hi OptiX team! My deepest respect to all you for OptiX ray tracing system! This is an amazing super great product! I worked on it shortly 10 years ago. How great is it now, I can’t believe!
Now it’s time for me to use OptiX from other side! :)

I need to make some dynamic acceleration tree where there are 2 levels of IASes and a bottom level of GAS. This is some object noise based instancing based on procedural distribution on top of surface with object randomization. The aabboxes are not are known.
So at each level of transition from top level IAS to some bottom level IAS I am going to inject a custom user primitive which has my needed functionality: it will perform some magic and dynamically select some of available OptixTraversableHandle and launch optixTrace within it. Based on the inner optixTrace result the ray will reportIntersection or not for the higher-level optixTrace. The ray itself will not change direction or origin. However I can do some clipping.
So the amount of nesting of optixTrace is 3. Only at very bottom it will use all RTX processing with bvh and triangles but this bottom-most level is rather small amount of data.
Will this construction be efficient? How large is the stack/var size for each nesting optixTrace? Is it a good practice? I haven’t yet tested. But originally I wanted to use native Optix multi-level IAS traversal but didn’t find how to inject own bits of computation at the stages when the ray transitions from one IAS to the child IAS.

Best regards,
Kirill

Answering your topic’s title question, it’s not possible to call optixTrace() from the intersection program domain.
This table of the OptiX device-side functions inside the programming guide describes which function can be called in which program domain:
https://raytracing-docs.nvidia.com/optix7/guide/index.html#device_side_functions#device-side-functions

There is no way to dynamically change the traversal through a hierarchy of IAS. OptiX removed support for Selector nodes with OptiX 6.

I’m not sure how you would handle any acceleration structure when the AABBs are not known or why exactly you need multiple levels from your description.

Some draft thinking:
If each IAS contains a non-identity transform, then shooting a ray against a proxy geometry would allow to determine which instances inside the top-level IAS you have hit inside the anyhit or closest hit program and could return the traversable handle of the child of that instance.

In case you need all hits along a ray, the potential hits would either need to be stored by an anyhit program into some limited size array, or you would need to iterate over the top-level hierarchy until there is no closest hit and do the following on each closest hit invocation. Latter would require a small ray stack and local root traversable handle.

Means you’re getting back into the ray generation program and can shoot the next ray with the child’s traversable.
Note that this would need to take the different world space into account which is defined by the parent instance transformation above it. You would need to transform the ray into that space to maintain a consistent transformation hierarchy.

Repeat until you’ve reached the bottom geometry acceleration structure.

That would be an iterative approach because there would only be one optixTrace call needed inside the ray generation program.

That would also mean that each level of your hierarchy could be stored inside a single level hierarchy and if the proxy geometry for the instances is built with triangles, the whole traversal would be fully handled by the hardware on RTX boards.

Still the crucial part here would be that you must know some safe AABB limits around the bottom level geometry or there is no way to construct any proxy geometry for the instances above it. If you cannot determine that closely enough, that would be similar to not having any acceleration structure and that is not possible in OptiX 7 and would be slow since each ray would need to test everything.

Hi Detlef!
Thanks for your quick reply!
I am a bit surprised there is no way to traverse the tree with some transition points where application may do some own things, e.g. change traversable handle on the fly and continue deeper.
Of course the AABBs of GASes are known, they are close to the tight-most AABBs and the IAS on top of that can be made. Or a GAS on top of custom primitives.

Ok, so can we return a traversable of hit point in this case and also get the traverse stack from that hit point and then reuse it if needed to re-launch optixTrace from interrupted place in the accel tree? I need to find the closest hit, but the structure is complex. Enumerating the list of potentially intersected proxies using anyhit and collecting them in array and working with them in raygen - this would be not good. Too much data for cases with high depth complexity, overflow issues. And there probably could be optixTrace launch overhead.

The easiest approach would be to have custom intersection program with some computations and dive into another subtree from it.

One workaround to this can be a cuda traversal code for top-levels inside raygen and optixTrace inside the bottom level. This is a solution with potentially reduced speedup gains.

Another question. How to reduce the size of triangle GAS? I get something like 95-145 bytes / triangle with lower 95 number of bytes for compaction. It seems to be too high! Is there any chance to make something like 40-50 bytes per triangle in a GAS?

I am a bit surprised there is no way to traverse the tree with some transition points where application may do some own things, e.g. change traversable handle on the fly and continue deeper.

For performance reasons it’s simply not possible to interrupt the BVH traversal at IAS level.

Ok, so can we return a traversable of hit point in this case and also get the traverse stack from that hit point and then reuse it if needed to re-launch optixTrace from interrupted place in the accel tree?

That would only be possible with the proxy-geometry method I described where you’d need multiple IAS-> proxy GAS (eg. the AABBs of the children as box primitive constructed from triangles).

Those instances would need to store all information you require to select the desired child traversable. This could be held at a shader binding table record per instance.
Doing that in a single launch would mean that the shader binding table would need to store this information for all instances in all levels.

You would be responsible to store the required data in a way you can access as needed. The OptixInstance has two user defined fields istanceId and sbtOffset which can be used to control which SBT record is accessed and for other user defined data indexing tricks.

Mind, I have not tried this. It’s just an idea to architect what you are looking for with existing OptiX 7 methods while trying to keep things fully hardware accelerated on RTX boards (not using custom primitives for the proxy geometry).

The takeaway is that once you hit a primitive in a GAS and reached the hit programs, you have all information about the transformation hierarchy over it.
Since you want that at each IAS level and change the traversed hierarchy, I see no other way then staging the traversal into multiple single level IAS->GAS levels.

When there is only one IAS level that is pretty easy to retrieve.
Special case example code here: https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/rtigo3/shaders/closesthit.cu#L46

But it’s getting rather complicated when there are motion transforms involved but OptiX provides helper functions which show how to iterate over the transform hierarchy and to concatenate the final matrices. Defined in OptiX SDK 7.2.0\include\internal\optix_7_device_impl.h. resp. implemented in optix_7_device_impl_transformations.h
Example usage here: https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_motion_blur/shaders/closesthit.cu#L72

Which GPU are you using?

The compaction efficiency depends on the GPU and the number and spatial distribution of the primitives.

The nvlink_shared example in my OptiX 7 applications on https://github.com/NVIDIA/OptiX_Apps (also works on single GPU) is printing the size in bytes for the vertex attributes, triangle indices, and the compacted GAS of models here:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/src/Device.cpp#L1290

Means in that code calculating accelBufferSizes.outputSizeInBytes / (indicesSizeInBytes / 4 / 3) will give the bytes per triangle.
For a triangulated sphere or plane mesh I get values of 28 resp. 40 bytes per triangle on Turing and Ampere boards.
Maybe compare that on your setup.

GAS Size:
I have tested on GTX 1070. Not an RTX, but anyway.
Example optixHead/Head.cpp. There I printf this:

    printf("\n GAS: numTriangles %d / numVertices %d bytesPerTriangle %d outputSizeInBytes %llu tempSizeInBytes %llu \n",
            buildInput.triangleArray.numIndexTriplets, 
            buildInput.triangleArray.numVertices,
            bufferSizes.outputSizeInBytes / buildInput.triangleArray.numIndexTriplets, 
            bufferSizes.outputSizeInBytes,
            bufferSizes.tempSizeInBytes);

It outputs this:
GAS: numTriangles 60880 / numVertices 31326 bytesPerTriangle 145 outputSizeInBytes 8865408 tempSizeInBytes 317312
145 is too much. 28 is what is very-very ok for bvh + indexed triangle array. I will run your apps.
Btw there is small bug in stats of this sample where it prints:
Head:
Number of vertices: 31326
Number of triangles: 182640
but in reality it prints not the number of tris, but the number of all indices, 3x more.

That explains it and the results are reasonable if the GTX board gets below 100 bytes per triangle after compaction.

RTX boards will compact much better, normally resulting in less than half of the initial AS build’s temporary buffer size.

It seems that on GTX you store all 3 vertices of triangle per leaf in some Woopify layout, probably 1 triangle per leaf and a couple of splits and full precision AABBs :) Are there any flags to reduce it and compaction size? After compaction there are 95 bytes per tri.
So, what are the usual bytes/triangle on RTXes like 3080? Haven’t tested yet them.

Few other questions:

  1. Anyhit are ordered as the BVH traversal performs where each ray selects the closest child based on minimum distance to ray origin along direction. And this way we can enumerate all leaves from aproximatelly closest and so on. Right?
  2. I have some rough idea to make a fake AABB element inside IAS organisation of normal elements. This fake AABB has large enough extents (slightly larger than IAS whole AABB) and is always the first in Anyhit sequence of calls. Later in this shader for this AABB I perform some computations and store them in a payload. And ignoreIntersection. All further Anyhits are empty, yes heard thery I less efficient, but this is not for all IAS levels.
  3. So what can we do with this? I hope to redirect rays to different paths based on instance / ray visibility bits. Is it possible to change ray visibility mask on the fly? However there are just 8 bits but I am exploring ideas.
    Is it possible to change ray tmin and ray tmax on the fly during ray traversal?
  4. Is it possible to update only visibility masks of available compacted IAS accel structure in a way as fast as CUDA kernel
visibility[i] &= ~(1<<new_visibility); visibility[i] |= (1<<new_visibility);

without touching AABBs or even traversable handles. Btw will ray traversal work if traversable handle has invalid value?
5) Does OptiX propagate instance visibilities hierarchically from child to parent? I mean a case where two children have zero bit at same position and their parent has zero bit in the same position too. So in the ray traversal the ray will cull the intersections on parent level but not on it’s children level.

Sorry for so many questions :). All my head is inside OptiX thoughts and ideas for some complex things.

Hi Kirill, I’ll jump in to help a little while Detlef is sleeping…

Are there any flags to reduce it and compaction size?

Not right now. Compaction doesn’t have any controls, and the final compacted size depends on the GPU architecture.

  1. Anyhit are ordered as the BVH traversal performs where each ray selects the closest child based on minimum distance to ray origin along direction. And this way we can enumerate all leaves from aproximatelly closest and so on. Right?

You should assume there are no guarantees on anyhit traversal order. Anyhit can and will be called out of t-order in practice, and the hit points for successive calls can be surprisingly far apart. Let go of any notions to leverage locality of hits in anyhit calls… ;)

Is it possible to change ray visibility mask on the fly?

Not currently. You can launch rays with different masks, but the mask cannot be changed for a given ray after optixTrace() has been called.

Is it possible to change ray tmin and ray tmax on the fly during ray traversal?

tmax is updated automatically with the intersection results, it will be clipped by the minimum value seen and prevent future traversal of anything further away. tmax can’t be increased during traversal, and tmin is not updated during traversal.

Btw will ray traversal work if traversable handle has invalid value?

No, using an invalid value will result in a crash.

Does OptiX propagate instance visibilities hierarchically from child to parent?

OptiX does not update the visibility masks for any instance other than the one you set - each instance mask you set is static regardless of hierarchy. If a parent is culled during traversal, then none of it’s children will be traversed regardless of what is in their visibility masks.

I feel like I brought a lot of bad news, but keep in mind the flexibility is there, you can do most of the things you’re asking about across multiple iterative or recursive calls to optixTrace(), just obviously at the cost of relaunching the ray.


David.

Hi David! Bad news is fine. I am looking for most efficient possible way. Right now there is chance to call optixTrace few times with different tmin/tmax intervals that divide the space and check different visibility masks. So, e.g. 4 simpler reduced optixTrace can replace 1 complex optixTrace that I was looking for.
Let’s return to my question #4:

Another question. I have noticed that optixAccelBuild() produces just one GAS and one OptixTraversableHandle that can be used as a reference for one element of IAS. optixAccelBuild() can build many build inputs in parallel but they are all combined as one GAS. How to build many GASes in parallel in case each GAS is too small to saturate the GPU? There are streams, but they scale at not very high numbers, not 1000s. I would like to have a possibility to produce a lot of GASes in parallel, thousands or even more. Is there any way?

Is it possible to update only visibility masks of available compacted IAS accel structure in a way as fast as CUDA kernel […] without touching AABBs or even traversable handles

Oh, sorry I missed answering that one. Currently it’s not quite as simple as touching the mask values in memory. You will need to call optixAccelBuild() with the updated information. I believe you can use the OPTIX_BUILD_OPERATION_UPDATE flag, which is much faster than a BVH build from scratch. It may be possible to create a fast path for updating only the visibility masks, I’m not sure anyone has asked for that before. Normally instance masks are static and people only change the ray’s visibility mask on the fly. That can be done per-ray, and at the optixTrace() call level, so the ray mask can change as quickly as you can trace rays, and you can use different masks for different ray depths, different materials, etc…

Another question. I have noticed that optixAccelBuild() produces just one GAS and one OptixTraversableHandle that can be used as a reference for one element of IAS. optixAccelBuild() can build many build inputs in parallel but they are all combined as one GAS. How to build many GASes in parallel in case each GAS is too small to saturate the GPU? There are streams, but they scale at not very high numbers, not 1000s. I would like to have a possibility to produce a lot of GASes in parallel, thousands or even more. Is there any way?

This is a good question. Streams are the primary way to build multiple GASes in parallel, outside of the multiple build inputs functionality. A couple of things to note: you’ll find that GAS builds have some overhead outside of the per-primitive costs, so there will always be benefits to combining things into a smaller number of GAS builds if/when possible. From what I’ve heard, I believe you will get the maximum parallel BVH build performance with a relatively low number of streams, like 4-ish. The most common problem when people build many small BVHs is memory allocation and not the build itself. If you’re going to build a lot of small BVH, then make sure to batch your memory allocations for the output buffers together, make sure to batch your temp buffer allocations together (use a ring buffer or something), make sure you’re not calling cudaFree() in the middle of your parallel builds, etc. Aside from that, the best advice I can give for the highest performance BVH builds and saturating the GPU is to use fewer larger BVHs if you can.


David.

In my case the cudaMalloc/cudaFree is not a penalty, because I have made a memory manager that allocates the chunks of 128MB on device and saturate them with another my own simpler lighter allocator as needed. For example fill them with blocks of geometry data. Then, when the renderer is closed, only then the data is cudaFreed. I have made sure that each request is smaller than 128Mb. Another case is for larger arrays (like rays in a multi-kernel approach). It’s simply some kind of 1Gb of data cudaMalloced once. And later all temporary CUDA arrays are allocated incrementing a ptr counter and freed in reverse order in the same scope. Things are more complex internally but the basic idea is like this.
Streams indeed accelerate until the number of 4 streams, sometimes marginally up to 8 streams for simple operations. But haven’t tested for some very complex pipeline of many kernels that can be a bvh builder. I understand that in principle we should unite as many data as possible for single GAS but this is performance optimization. Sometimes the situation offers to build 1000 of GASes each with ~1000 of elements. The GPU will work well for the total of 1M elems but not well for many launches of 4000 elems where over 100 of kernel launch overheads will make impact.
Ok, is there any way to reference a build input from GAS but not the whole GAS for elements organised under IAS?

In my case the cudaMalloc/cudaFree is not a penalty, because I have made a memory manager that allocates the chunks of 128MB on device and saturate them with another my own simpler lighter allocator as needed.

Excellent!

Ok, is there any way to reference a build input from GAS but not the whole GAS for elements organised under IAS?

I’m not sure I understand the question, but maybe you’re looking for one of the two functions optixGetInstanceId() or optixGetInstanceIndex()? https://raytracing-docs.nvidia.com/optix7/guide/index.html#device_side_functions#device-side-functions


David.

No, currentyly each element that is organized under IAS is ref to GAS (gas handle):
optix_instances[0].traversableHandle = state.triangle_gas_handle;
It uses it’s whole AABB. A GAS may contain a lot of build_inputs (different sets of triangles).
What I want to do is to reference a build input (and it’s smaller AABB) from each IAS elements. For example, we build 1 GAS with 1000 build inputs in one single call optixBuildAccel. And for top-level IAS we create 1000 elements each referencing the AABB and some handle of each of build inputs of GAS. Seems to be this is not possible too but I could miss something.

Oh, I see. Right, this isn’t possible in the API right now. There is no explicit smaller AABB or sub-instance that is created from a build input. One of the reasons to put multiple build inputs in a single GAS is to prevent AABB overlap of the separate build inputs. Instead a BVH is built from the primitive soup of all the combined build inputs, so that you get the optimal bounding box hierarchy.


David.