BVH questions

Hi!

I have two questions regarding the BVH acceleration structures:

  1. How can I determine how many objects are stored per leaf in the BVH? Is this a dynamic value and can it be retrieved from the BVH structures? In one of the example programs on github, I found the use of the undocumented Bvh8 (optix_advanced_samples/optixParticleVolumes.cpp at master · nvpro-samples/optix_advanced_samples · GitHub), which I assume has 8 objects per leaf. Is that correct?

  2. When intersecting a ray with the BVH using rtTrace and both, an any hit and closest hit program are registered with the geometry, can I assume that the BVH structure is traversed starting with the closest leaf to the ray origin?

Best,
Christoph

Hi Christoph,

1- You can’t query the acceleration structure for maximum number of objects per leaf, and even if you could it’s subject to change between driver and hardware releases.

Yes “Bvh8” is an (up to) 8-wide node structure that is implemented in multiple code bases in various similar ways. You’ll be able to find papers and example code online that show how it works in general, though we may have NVIDIA-specific implementation details.

Please note that with RTX hardware, the BVH type string in that sample you linked to is ignored, and a hardware specific BVH format is used instead. If you have an RTX GPU, you won’t be getting Bvh8 even if you request it.

Also note that we don’t list Bvh8 because it’s experimental and not our default recommendation for traversal performance.

2- You cannot assume the order of traversal with rtTrace(). The OptiX documentation is explicit about this, for example https://raytracing-docs.nvidia.com/optix_6_0/guide_6_0/index.html#programs#attrvars. If you use AnyHit shaders, you will find they’re often called out of depth order. That is actually the main function of the sample you linked to – optixParticleVolumes is collecting a set of intersection points and then sorting them in depth order in order to do the correct depth blending.


David.

Hi David,

thanks for your quick response! As to

  1. Hmmm, that’s unfortunate. It’d be great if I could query the number of objects per leaf and then adapt my code to account for it, depending on driver and hardware. The upper bound of objects per leaf is exactly what I need to speed up my code (so Bvh8 would be exactly fitting). I indeed have an RTX GPU but do not use the RTX cores for accelerated geometry calculations, do I still not get the Bvh8? Is there any other implementation that I could query with a fixed upper bound for objects at the leaf?

  2. I’m well-aware that I can’t influence the order of intersected objects within on leaf of the Bvh. However, if I’d know that 8 is the upper bound of objects per leaf and I’m using (a multiple of) this value for a sorting buffer, then I’m guaranteed to be able to find the correct object within the Bvh leaf. I think that’s actually a trick that’s also done in the optixParticleVolumes example with a particle buffer size of 32 and a Bvh max leaf size of 8. What I’m still interested in and am not sure whether you answered in your previous question is whether the first leaf that is traversed over is the one with the lowest t value for the ray (which would make sense if a closest hit buffer is registered).

Thanks a lot! :)
Christoph

You still get RTX hardware traversal even when you don’t use the hardware triangle intersection, so yes you are using the RT cores and the BVH format is still hardware specific.

There are no guarantees on either upper bound children per leaf, nor traversal order of leaf nodes. Note that traversal order of leaf nodes is always ambiguous, even if they were traversed in a known order. With a given ray, you can always arrange the primitives in two leaf nodes such that one leaf is traversed before the closest intersection point is found in the other leaf.

The buffer size of 32 in optixParticleVolumes is a only convenient number, it is not related in any way to the number of children per leaf node. It’s just a good power of two for doing the parallel bitonic sort, and it’s large enough to usually/mostly capture enough transparent particles that you don’t see any artifacts.

I’m curious what decisions you can make to speed things up if you could count the children in a leaf node? If your use case is general, we will certainly consider making adjustments to support what you need.

If you want to guarantee you have exactly 8, or no more than 8 children in a leaf node, here is what I would recommend trying: do a single pass clustering algorithm of your own and write a custom intersection program that handles 8 children at a time. This way you have control over what you consider a leaf node and what’s inside of it, and you can still use the RT cores in a way that is agnostic to implementation.


David.

Hi David,

“You still get RTX hardware traversal even when you don’t use the hardware triangle intersection, so yes you are using the RT cores and the BVH format is still hardware specific”: that’s great to know!

“The buffer size of 32 in optixParticleVolumes is a only convenient number, it is not related in any way to the number of children per leaf node. It’s just a good power of two for doing the parallel bitonic sort, and it’s large enough to usually/mostly capture enough transparent particles that you don’t see any artifacts”. Ok, understood. However, if more than 32 objects are in one ‘slab’, that means that in this example all other objects are lost, right? The ray is terminated in that slab and a new ray is used to traverse the next one. Or did I misunderstand? In that case, it’s important to just keep the maximum number of objects per slab known and <32.

“I’m curious what decisions you can make to speed things up if you could count the children in a leaf node? If your use case is general, we will certainly consider making adjustments to support what you need”. Thanks a lot for the consideration! Probably the BVH in OptiX is overkill for my use case and I’ll retreat to a CUDA based implementation.

As for your last suggestion: do you mean collecting all collisions with an any hit program and the BVH, then cluster, then use the custom intersection program? Or do you mean clustering and intersecting fully custom? In that case I probably couldn’t make use of the RTX acceleration since there are no CUDA commands to use the RTX infrastructure, right?

Best,
Christoph

if more than 32 objects are in one ‘slab’, that means that in this example all other objects are lost, right?

Yes, this particle rendering example is only approximate, and it assumes that by the time you composite 32 particles, you probably won’t see the results of any more. It’s usually true, but there are some small visible artifacts if you look closely.

This sample is assuming that leaf node traversal is somewhat approximately in depth order, and that there’s high probability of the several closest hit points being in the first 32 hit points collected. That’s not guaranteed the way this sample is written, but it is still fairly likely.

do you mean collecting all collisions with an any hit program and the BVH, then cluster, then use the custom intersection program? Or do you mean clustering and intersecting fully custom? In that case I probably couldn’t make use of the RTX acceleration since there are no CUDA commands to use the RTX infrastructure, right?

I was suggesting the latter: cluster into groups of 8 in advance of building your BVH, and use your own fully custom intersector. If you use OptiX, you will still get automatic RTX acceleration in the form of traversal to your leaf nodes, then your intersection program would take over and iterate through the leaf node children, you wouldn’t need any special CUDA commands.


David.