Best way to turn entities ON/OFF during ray tracing in Optix

There is a requirement to be able to turn the visibility of entities on a specific layer on/off during the ray tracing(something like what a CAD program does). Entities on one layer do not specifically belong all to one single mesh. For example all rooms ceilings triangles are in “ceiling” layer, all the floors surfaces in “floor” layer and so on. And buildings are scattered all around in the model. So there is a high chance that ASs (if created one AS for each layer) overlap together.

Initially setting the OptixInstance visibilityMask seemed to me the right approach. However I am not certain on this and wonder what is the best approach to achieve this?

Specifically I am concerned about the BVH structures heavily overlapping and performance issue; as I said for example all ceilings, floors, walls triangles though each belong to different ASs, are close together and scattered all around in the model.

Thanks for any thought on this.

Visibility masks are the highest performance mechanism for creating different layers you can turn on and off dynamically. With an RTX GPU, there is hardware support for visibility masks. They are specified at the instance level (or Group/GeometryGroup in OptiX 6 terminology), so traversal is terminated early when the ray’s visibility mask doesn’t match the instance visibility mask. This means that overlapping instances won’t normally be a problem, because the ray doesn’t traverse the associated GAS.

Just be aware that you only get a limited number of bits for your visibility mask, and you’ll need one bit per layer. In OptiX 7, your code should query OPTIX_DEVICE_PROPERTY_LIMIT_NUM_BITS_INSTANCE_VISIBILITY_MASK. On Turing and Ampere there are 8 bits.

If you need more than 8 layers, you can get layer toggling functionality by using anyhit + optixIgnoreIntersection(), or by building multiple different combinations of your top level acceleration structures. The anyhit solution is not as fast as a visibility mask, but can be functionally equivalent. Building multiple top level accelerations structures will obviously consume some extra memory. You might be able to statically or dynamically manage your layers to use visibility masks for the most commonly used layers, and fall back to one of the other methods for your less commonly used layers.


David.

Thanks David for your prompt reply. I will implement this and will come back here if I had more questions. Meanwhile I have one question:

So in visibility mask approach I will have one instance(and one GAS) per layer?
And one top level IAS that will have all the instances? correct?

I wonder how different is this approach comparing to what you mentioned:
“…or by building multiple different combinations of your top level acceleration structures.”

So in visibility mask approach I will have one instance(and one GAS) per layer?

That’s really up to you, you can organize your layers however you like. You can assign any mask (layer) value to any instance, so if you want to have a lot of separate instances in a layer you can, or if you prefer to put everything in a layer into a single GAS, that’s perfectly fine too.

And one top level IAS that will have all the instances? correct?

This is also a choice, and not a limit. One top level is normal, but you can freely use a deeper hierarchy with visibility masks if you like.

I wonder how different is this approach comparing to what you mentioned:
“…or by building multiple different combinations of your top level acceleration structures.”

At a really basic abstract level, you might imagine that if you had 10 layers that can toggle visibility, there are 2^10, or 1024 possible combinations of visible layers. That’s a lot of top-level ASes to build, so you’d probably want to build them on-demand and maybe cache them. So the first time you toggle visibility maybe you have to wait for a top-level build, but toggling again would be instantaneous.

If you have 8 visibility masks to use, then you could assign them to your most commonly used 8 layers, and then you’d only need to build combinations for the other 2 layers, meaning you’d only need 2^(10-8), or a total of 4 combinations of top-level builds. By using visibility masks, you get to reduce the combinations by a lot.

If you have a very high number of layers, then I would expect building the different combinations to be impractical, and in that case you’d probably want to resort to using the anyhit shader. No reason you can’t combine all three techniques, other than I imagine it will probably take a lot of engineering effort to support.


David.

If the whole scene fits into your system, which would need to work when all layers are active, and all subsets of objects in their GAS are built into a different IAS->GAS subtree per layer, then you would have as many OptixTraversable handles as there are layers.

Building a top-level IAS with only the active layers’ instances on every visibility toggle should be reasonably fast.

Means your final render graph hierarchy would be three AS deep:
IAS (one instance per layer) -> IAS (one or more instances per set of objects) -> GAS (geometry of the unique objects)

(No real need for the visibility mask method then, but as David explained these two can be used together to reduce the number of required top-level IAS rebuilds for the most common layers switches.)

There also shouldn’t be a need to change the shader binding table if that is built for the “all layers active” case.
Your instance IDs and SBT offsets wouldn’t change when adding and removing instances from the top-level IAS.

That would esp. not incur any rendering overhead for hidden layers because the objects on the deactivated ones wouldn’t be inside the reachable traversal hierarchy.

I have used the anyhit + optixIgnoreIntersection() method in the past and performance suffers most with that. I wouldn’t do that, esp. not when most of your materials do not require an anyhit program otherwise.

(Sidenote: This would also be the way you could implement a configurator, like selecting different wheels on a car.
The difference is that there would be different geometries for the same object, but only one of them would be visible at any time. This implies that the sum of all objects inside a configurator setup could be bigger than would fit into the renderer, if the required GAS are built dynamically.
Most often things will fit and then the method is the same, while above visibility behaves like a DIP switch and the configurator method like a radio button.)

Thank you both for the good answers.

I implemented the visibility mask concept and it works fine and I am able to toggle between layers visibility. That is cool. Thanks.

However the fps got twice as slow. So I am wondering what could cause the performance hit?
The hierarchy (from bottom to top) now looks like this :

  • one GAS (with one geometry) of all triangles for a layer
  • an OptixInstance (with the above GAS traversableHandle) for each layer
  • a root IAS containing all the instances

IAS (one instance per layer) -> GAS (geometry of the unique objects)

So I guess comparing to your suggestion I am missing the middle IAS ? but as there is only one GAS per instance I don’t see the point of having an IAS for it. Is that correct or I need the middle IAS ? Could this cause the performance hit?

Previously I had one GAS (with one geometry mesh containing all vertices and triangles indices of the scene) and this inside the top root IAS. The fps for this was almost twice as fast. In the new approach I have split the main vertex buffer and triangle indices in to multiple buffers for different layers. That is each GAS now takes its own vertex and triangle indices buffer. And so the total vertices could potentially be more than the first scenario.
For example in below image, the scene triangles are grouped in three layers colored in red, green and blue. and each layer is a GAS inside an OptixInstance, and so each GAS would have its own vertex array.
I cannot have one vertex buffer for the whole model and share it among multiple GASs ? Or this is not a bottleneck anyway?

Another thing I am suspicious of is again overlapping IASs. Since layers triangles are scattered all around, the instances bounding boxes will overlap. Obviously, as in the below image, the bounding boxes of these three instances heavily overlap and a ray in almost any direction has to check more ray - instance BB test. Is this correct ? and could this be the cause of performance hit?

image

However the fps got twice as slow. So I am wondering what could cause the performance hit?

When speaking about performance, please provide absolute numbers and the system configuration.
In case this drops from 60 fps to 30 fps this could be as simple as having VSync enabled for the final display mechanism.

IAS (one instance per layer) -> GAS (geometry of the unique objects)
So I guess comparing to your suggestion I am missing the middle IAS ?
But as there is only one GAS per instance I don’t see the point of having an IAS for it.
Is that correct or I need the middle IAS ?

If you had a single level instancing mechanism before and all geometry per layer in one GAS and use the visibility mask method to encode e maximum of eight layers, there is no need for an additional IAS level.

Could this cause the performance hit?

I wouldn’t expect a performance reduction when using the same hierarchy with and without visibility masks when all layers are enabled.

Previously I had one GAS (with one geometry mesh containing all vertices and triangles indices of the scene) and this inside the top root IAS.

If everything was in one GAS, why do you need an IAS on top?
You did not use OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_GAS?

Ok, that could actually be a difference, but a factor of two sounds rather high.
The OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING hierarchy is fully hardware accelerated on RTX boards.

What is your system configuration?
(OS version, installed GPUs, display driver version, OptiX version (major.minor.micro), CUDA version, host compiler version.)

The fps for this was almost twice as fast. In the new approach I have split the main vertex buffer and triangle indices in to multiple buffers for different layers. That is each GAS now takes its own vertex and triangle indices buffer. And so the total vertices could potentially be more than the first scenario.

Ok, you’re saying that you reused vertex information in the single GAS hierarchy, while you duplicated some shared vertices in the single level instancing case for the layering.

For example in below image, the scene triangles are grouped in three layers colored in red, green and blue. and each layer is a GAS inside an OptixInstance, and so each GAS would have its own vertex array.

I cannot have one vertex buffer for the whole model and share it among multiple GASs ?

Memory management is your responsibility.
You could put all vertices in one buffer if you want and have the individual GAS be built by using the respective primitive indices.
I see no problem with that, other that if the scene becomes really large, this relies on CUDA to find a necessary big enough contiguous memory block to allocate the data.

Or this is not a bottleneck anyway?

Unlikely if the individual objects aren’t built of rather few primitives and the previous sharing was highly effective.
Do you use acceleration structure compaction?

Another thing I am suspicious of is again overlapping IASs. Since layers triangles are scattered all around, the instances bounding boxes will overlap. Obviously, as in the below image, the bounding boxes of these three instances heavily overlap and a ray in almost any direction has to check more ray - instance BB test. Is this correct ? and could this be the cause of performance hit?

Yes, this is going to behave worse compared to a single GAS since that can be better optimized spatially, while the IAS AABBs overlap and would basically all be checked in your current setup.
This is also happening when engines sort their geometries by material which isn’t the best idea for a spatial acceleration structure.

You could for example split your geometries per layer into multiple instances for that case to reduce the AABB size per instance to reduce the overlap.

If you can handle the whole thing with a single level instancing hierarchy, that’s fine, even if there are many more than number of layers instances inside the top-level IAS.

Still, using instances for that layering mechanism is the only reasonable choice if you do not want to rebuild your previous single GAS on each layer toggle, which would be another solution.

That was very helpful thanks.

This is also happening when engines sort their geometries by material which isn’t the best idea for a spatial acceleration structure.

That is actually what I am trying to do. My apology if I didn’t articulate it better in the first place.

So it is a rather small scene of around 150k triangles.
image

The fps for this view between the two versions (with and without grouping the geometries by their layer/material) is
24 (without) versus 20 (with grouping) and for a closer view it is 7.5 vs 4.5. So it varies between 17% to 40% reduction.

What is VSync and how can I enable or disable it? Could you elaborate more on this ? thanks.

I do pass OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING flag to pipelineCompileOptions.
The application is based on your Optix_App intro_runtime code and that had already the flag.
Or does it need to be passed to optixTrace method as the man page suggests?

Specifies the set of valid traversable graphs that may be passed to invocation of optixTrace()

Does compaction increase the traversal speed ? I need to try this.

Thanks for your helps.

What is VSync and how can I enable or disable it? Could you elaborate more on this ? thanks.

Vertical sync means synchronizing with the monitor refresh rate.
If you’re using OpenGL to display, the default driver behavior on NVIDIA graphics boards is VSync enabled.
You can either disable that globally inside the NVIDIA Control Panel under
3D Settings -> Manage 3D Settings -> Global Settings -> Settings -> Vertical sync -> Off
or inside an OpenGL application by setting the swap interval to zero ( == immediate swap) with this WGL extension
https://www.khronos.org/registry/OpenGL/extensions/EXT/WGL_EXT_swap_control.txt
Similar under Linux with the GLX version of that.

That should always be disabled for benchmarks with display to the screen.
At frame rates of 4.5 fps that hardly makes a difference.

20 fps seems low for a 150k triangles model with all diffuse materials in that view. What resolution and GPU is that?

Or does it need to be passed to optixTrace method as man page suggests?

You misunderstood the text in the API reference.
It means that the traversable argument you use inside optixTrace() calls must match the type of AS and traversal depth you specified with the OptixTraversableGraphFlags used on the host only in the OptixPipelineCompileOptions traversableGraphFlags.
Explained here: https://raytracing-docs.nvidia.com/optix7/api/html/group__optix__types.html#gabd8bb7368518a44361e045fe5ad1fd17

Since my examples are all using OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING, the traversable inside optixTrace() is always the scene root IAS traversable handle.

Does compaction increase the traversal speed ? I need to try this.

Most likely, because the AS gets smaller and saves memory accesses and improves caching.
(It’s GPU dependent because RTX boards compact a lot better.)

There are also additional OptixBuildFlags which affect the BVH.
Using OPTIX_BUILD_FLAG_PREFER_FAST_TRACE as well can improve the traversal speed at some increased AS memory cost.

Ok thanks.

  • Geoforce GTX 1060 6GB, 1280 cores
  • CUDA 10.0
  • Driver ver: 456.71

I have attached the obj file and if you may have a chance I am interested to see what fps you get in case I am missing or doing something completely wrong.
The coordinates in the obj file are in world coordinate(so z is height) rather than camera space as in the intro_runtime code.

The center of view for the below image is:
m_pinholeCamera.m_center = (-31.4486605, -6.7559335, 11.74482)
m_pinholeCamera.m_fov = 60
m_pinholeCamera.m_distance= 50

I get fps 6.5 for this view with resolution 1647 x 1000
( all scene triangles are in one GAS and this inside one IAS)

PROPOSED.obj (5.0 MB)

I see, the GTX 1060 is no match for a current Ampere based RTX board.

Loading that OBJ into my rtigo3 example had the scene flipped horizontally, so I let the assimp loader convert the scene assuming left-handed coordinates (OBJ default is right-handed), enabled generation of flat instead of smooth normals, and rotated the model to have z-axis up with the top-level instance transform, all using the same diffuse materials.
(You seem to have at least one more on the building in the right background.)

For comparison with the currently highest-end workstation board available, with a constant environment and two path segments to get this ambient occlusion look, this runs at around 300 fps on an NVIDIA RTX A6000 in my example using OpenGL interop. No surprises here.

Thanks for trying this. That is amazing 300 fps !!

So do you think 6.5 fps is expectable on GTX 1060 ? Would be great if you could run it on a similar device to 1060.
I am just trying to find out if I am doing something wrong or it is the device limitation.

Well, and that example renderer is not optimized for this case. I think this can be done even faster.

Yes, I think a GTX 1060 is that slow. It’s at the lower end of the Pascal boards and there are already three newer GPU generations shipping since then, two of them with hardware ray tracing support.

I don’t have any comparable setup. I only use recent high-end workstation boards for my daily work.

Not sure if my arithmetic is correct, but if there are 3 rays per pixel (1 primary ray and 2 AO rays), then 300 fps means 1647 * 1000 * 3 * 300 = 1.48B rays/sec. This seems a little slow for an A6000 (unless I mis-counted the number of rays).

I tried this model yesterday on the OptiX 6 SDK sample optixMeshViewer, because it can read OBJ files. The sample uses 2 rays per pixel (1 primary ray and 1 hard shadow ray). I was getting about 1400 fps on Linux using an RTX 8000 with roughly the same view as above, for a 1024 * 768 resolution. So the total is 1024 * 768 * 1400 * 2 = 2.2B rays/sec. I suspect that we could double or maybe even triple the rays/sec number if we took many samples per pixel… the overheads of an extremely high framerate like 1400 fps will be dominant: the kernel launch calls and the framebuffer handling alone really adds up with that many frames.

Anyway, @afatourechi you might also try the OptiX 6 optixMeshViewer on your GPU to see how many rays per second you get. But I agree with Detlef, the 1060 wasn’t built for ray tracing like the more recent ones, so you shouldn’t worry too much about its performance. For what it’s worth, I believe a newer & cheaper consumer RTX Ampere GPUs will perform as well or even out-perform my RTX 8000 here. It might be worth considering trying a GPU with RT cores before doing too much optimizing, just because the RT cores change the balance of ray tracing vs compute compared to the 1060.


David.

I suspect that we could double or maybe even triple the rays/sec number

Just to be a little less speculative, I tried modifying the SDK code (in pinhold_camera.cu) and I set the optixMeshViewer samples per pixel to 100 (with some pixel jittering to get antialiasing). Now I get a framerate around 27 fps. So the rays per second is 27 frames/sec * (1024 * 768) pixels * (2 * 100) rays/pixel =~ 4.2B rays/sec. Almost double with a quick hack. To be fair, this sample and my hack are both pretty cache friendly compared to AO rays.


David.

That’s probably right on average. In my case two path segments means either 1 (primary miss), 3 (primary hit, secondary miss) or 4 (primary hit, secondary hit) rays per pixel.

My OptiX 7 examples are global illumination path tracers with next event estimation and multiple importance sampling which support nested volumes with IOR and absorption and use direct callable programs for the lens sampling, light sampling and BXDF sampling and evaluation, including anisotropic distributions requiring tangent attributes.

Yes, rendering my image above should run a lot faster at the same quality when removing the unused features and special casing the closest hit program and light sampling.