Ray traversal slowdown with a distant object

I have a scene with 1000 meshes placed next to each other (each mesh is about 1k triangles and takes up a 1x1x1 unit volume). The ray traversal is very fast for this scene until I add a single mesh far away. When I place this single mesh far away (say 1e11 units), I see a slowdown of about 3-5x (tested on two separate GPUs). The further away the object, the slower the traversal. The scene consists of a single IAS with separate GAS for each mesh. If I merge the 1000 meshes into a single GAS, the traversal time is effectively unchanged when I add the distant mesh.

  1. With this type of scene, would it be expected that a distant object would degrade the acceleration structure and cause this slowdown?
  2. If this is a degradation of the IAS, are there any build flags that can improve the traversal of this type of scene?
  3. If I cannot improve this situation via build flags, is there a recommended way of structuring the scene that would circumvent this issue?

Any help or insights would be much appreciated. Thank you!

The ray traversal is very fast for this scene until I add a single mesh far away. When I place this single mesh far away (say 1e11 units)

For clarification, do you mean the OptixInstance transform is translating by that 1e11 or are the vertex positions inside the GAS that far away?

1e11 is 100,000,000,000 which is way outside the 32 bit floating point precision (23 bit mantissa). Values bigger than 8 million will start to show precision issues.

It’s usually recommended to model vertex positions around the origin and translate them into the world with instance transforms to regain some precision after the ray is transformed into the local object coordinates space during intersection.

If you say this works fast with just one GAS and all vertices in world positions but is slower when modeling objects around the origin and translating them with instance transform, that is surprising. In either case, the top-level AABB should be the same.

Just to make sure, you’re benchmarking this with fully optimized device code and all OptiX module and pipeline compile options set to full optimizations, disabled exceptions, and disabled OptiX validation mode?

To be able to say what happens exactly, we’d need some minimal and complete reproducer project.

tested on two separate GPUs

Please always provide the exact system configurations in case this is GPU or driver dependent:
OS version, installed GPU(s), VRAM amount, display driver version, OptiX major.minor.micro version, CUDA toolkit version used to generate the module input, host compiler version.

@droettger , thank you for the quick reply.

For each instance, the positions inside the GAS are centered around the origin. The distant object is translated to that location via the OptixInstance transform. And to be clear, when I merge the 1000 meshes into a single GAS, I end up with a scene with 2 instances: 1 for the 1000 meshes that are close together and 1 for the distant object still translated using the OptixInstance transform.

Yes, this was benchmarked with all optimizations enabled and validation mode disabled.

I will try to recreate this with an example from the SDK to try to provide a minimal working example since this behavior seems to be unexpected.

System configuration tested:
Machine 1:
OptiX 7.7.0
CUDA: 12.1
GPU: RTX A4000 Mobile with 8 GB memory
Driver: 545.29.06
OS: Fedora 38

Machine 2:
OptiX 7.7.0
CUDA: 12.1
GPU: RTX 4090 with 24 GB memory
Driver: 535.113.01
OS: Fedora 38

OK, thanks for the clarification. That is a different setup than I thought.
So it’s basically the same scene, once with 1001 instances and once with 2 instances, but always with the far away instance and that only influences the 1001 instances performance, so it’s probably not that specific far away instance alone which takes longer.

That it’s happening on two different GPU architectures and display driver versions is also interesting.

Yeah, we should definitely take a look at that.

It would be interesting to analyze where the time goes. That can be done by checking the GPU clock() at the beginning and end of the ray generation program and then storing some scaled float value with the time and visualizing that as heatmap.
I’m doing that in the USE_TIME_VIEW code path in some examples:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/rtigo12/shaders/raygeneration.cu#L235
Note that this per thread time measuring method is not going to work with Shader Execution Reordering (SER).

I was able to recreate this issue in the SDK which will hopefully make this easier to see and reproduce. For the most part I was able to recreate the scenario in a glb file to load into optixMeshViewer. I slightly modified optixMeshViewer to position the camera and run a benchmark (although you can see the time difference in the scene statistics clearly as well).

Modified optixMeshViewport code along with glb files that define each scene configuration are here IAS_slowdown.zip (877.2 KB). Interestingly, the problem only seems to happen when the meshes are randomly distributed and not when the meshes are in a grid. Additionally I reproduced it with simple cube meshes, so it seems to have nothing to do with mesh complexity. I’ve include both the random and grid scenes for reproducibility.

Scenes:

  • random_scene_1024instances.glb - 1024 meshes all near the origin
  • random_scene_1025instances.glb - same as 1024 but with one additional instance with translation 1e11,1e11,1e11
  • random_scene_2instances.glb - the 1024 local meshes all merged into one GAS, the second instance is the distant object translated with the OptixInstance transform

Thank you for pointing this out - it is very insightful! I encoded the time in the image for this scene in optixMeshViewer and have included screenshots of the encoded time in the zip file for reference. I’m not sure I have the perspective to draw many conclusions from this heat map, but it seems that the traversal for many of the rays that do not intersect any geometry bounding boxes are significantly degraded.

I might assume the issue could be one of accidentally capturing the distant child in an intermediate node. Something like this:

By merging the near-origin meshes first, you effectively partition it from the distant part, giving the tree a big hint on the correct spatial layout, and prevent the problem from happening. With the time-view you might need to zoom out far enough to be able to capture the boundaries of the problematic nodes, otherwise yes you’ll just see the whole screen get brighter and not be able to see why.

You could check if using the FAST_BUILD flag does a better job of avoiding such accidentally ballooned nodes in the tree.

Is merging the near-origin meshes before building the IAS not a workable or general solution for you?

And do you need to capture such very distant meshes? Note that for a unit volume mesh at a distance of 1e11, the distant child bounds in the IAS will much much larger than the mesh itself. A single bit change of the least significant bit of 1e11 is 8192, meaning your tree node’s bounding box must be at least this big if not bigger. That alone will cause some traversal slowdown whenever this distant object is in view or has rays that approach it. And the hit results will have extremely large error compared to the size of the mesh. So, it might be ideal, if possible, to detect and discard small and extremely distant objects from the scene.


David.

Thank you for the explanation of the potential tree problem. Is there any way to confirm this is what’s happening? I know I can view the BVH in nsight-compute, but as far as I’ve seen that only shows the aabb for the geometries without any insight into the tree structure.

I tried this flag but had similar results.

The scene is dynamic, so merging the meshes into a single GAS is an option for static parts of the scene, but not good as a general solution as I would need to redo this merge when transforms change.

I have been thinking through some potential workarounds, but ideally I could keep the scene structure general. Filtering out small distant objects make sense since you likely won’t see them in a render anyway. But in my case, my far away objects tend to be massive where the size of the object alleviates some of the precision issues you mentioned. Would a multi-level IAS get around the BVH build problem you mentioned without too much additional overhead?

I’ve asked around the team and learned that we’ve considered doing a specific automatic extreme distance outlier detection, but it’s still being discussed in terms of whether to do it, and how to do it without sacrificing build perf for more common uses.

This means I believe we don’t currently have any other settings that might help, unfortunately. I also learned that the extreme distance of an outlier can degrade the quality of BVH construction, which can result in increased over-traversal, in addition to the possibility of ballooned intermediate nodes.

The multi-level IAS could help in the sense that it means you’d conceptually be doing the same partitioning you already tested that resolves this issue. If you have all your near-origin meshes underneath one instance, and the very far meshes underneath other instances, then I do expect that tradeoff to pay off. There is a small traversal cost to adding another level in the BVH, but I would expect it to be much smaller than the slowdown you’re currently experiencing. The main thing using multi-level instancing would do is help you avoid having to merge or flatten your near-origin meshes into a single GAS. Is it reasonable to explore that path?


David.

Thank you for the explanation and looking into this. I test and look into implementing the multi-level IAS as that seems like a decent option that will allow me to keep transforms for dynamic objects and the option for geometry instancing.

1 Like