BVH building algorithm and primitive order

Do we get to know what sort of algorithm that optix uses to build the BVH? I am asking this because I found that the order in which I insert custom primitives (through OptixAabb) to build the GAS affects the traversal speed greatly.

No, that is completely internal because it can change any time and differs between GPU architectures and even scene contents (e.g. motion blur, triangle primitives, curve primitives).

The BVH traversal is “view” dependent. That means you could have different performance depending on how your rays traverse through the BVH.

Then you can of course build scenes in a way which is simply inefficient, for example when AABBs over primitives contain each other.
For example, let’s say you have an architectural scene with walls, windows, doors and roof materials on the different houses. Then building acceleration structures partitioned by these four different materials would generate huge AABBs over a whole town which also overlap badly. That BVH will be inefficient.
Means it’s sometimes better to partition your primitives spatially to get mostly disjunct GAS AABBs (eg. “per house” in this example) than sorting a huge scene by material.

Thanks for the response. I am not changing rays, nor the primitives in the scene. I have the exactly same primitives, just the order of them are different when I cudaMemcpy them to the device. I would expect that the BVH would be exactly the same, even if the primitive order is different in the memory, but what I found is that some order will result in much more efficient traversal than others.

To be more specific, I have a bunch of spheres of the same radius, and I construct an AABB for each sphere. I then create an OptixAabb array to pass to build the GAS. What I find is that the order of AABBs in that array will affect the traversal performance later. The actual traversal uses the exact same rays in the same order.

Please quantify “much more”. Benchmark results require absolute values.
Would you be able to provide a minimal and complete reproducer to be able to investigate what happens?

Hmm I think I found the bug in the code, but just to be sure, as long as the primitives (AABBs) are the same, regardless of their order in the memory, the resulting BVH should be the same, right?

I think it’s roughly the same, and you shouldn’t expect any large performance differences, but be aware that OptiX does not guarantee that the same AABBs will produce the same BVH every time - even if they’re in the same order. In practice your BVH will probably not be bit identical from run to run given the same input data.


David.

Oh this is quite interesting. Would you mind elaborating on this a bit?

I don’t really have more I can add beyond what’s been said, bit identical order is just not something OptiX specifies or guarantees at this time. I just wanted to be clear, as with making any assumption about the shape of warps, that relying on behaviors we don’t guarantee even if they appear to work can get you into trouble. I don’t even know all the reasons that can lead to small differences, but if you need bit identical for some reason, we’d be happy to discuss your use-case and have a discussion about whether we can support it.


David.

Thanks.

circling back to this question…is BVH construction accelerated in the RT cores or done in the conventional CUDA cores? I am curious about this because tree construction seems to be blazing fast. I guess there has been lots of research in how much time spent on building an optimal tree that can make traversal faster vs. building the tree as fast as possible. And so for exploring this trade-off maybe it makes sense for the tree construction to be done in the CUDA cores?

Thank you, yes, we strive to make OptiX blazingly fast. You are right, there has been a lot of research behind how to build BVHs quickly, and behind the tradeoffs of different representations and build options. In all cases when we build hardware, there are multiple levels of software simulation and research to validate and explore the tradeoffs. That really never dictates which hardware units are built or used once a feature goes into production.

From the user perspective, there are some tradeoffs that affect build performance. You can give OptiX a hint that you prefer a faster BVH build and allow for a possible compromise in traversal performance using OPTIX_BUILD_FLAG_PREFER_FAST_BUILD. Or you can lean the other direction and ask for faster traversal performance, even if the BVH build takes a bit longer, with OPTIX_BUILD_FLAG_PREFER_FAST_TRACE.

If you are building BVHs in interactive scenarios, then it’s important to know that OPTIX_BUILD_OPERATION_UPDATE can be as much as 10x faster than OPTIX_BUILD_OPERATION_BUILD, but the UPDATE operation can also degrade your traversal performance, so it’s important to be careful with it and understand when you might need to use BUILD anyway.


David.

Thank you for the clarification. I use the buildGas function provided in some SDK examples as the start point to build the GAS. I had about 10M AABBs. What I found is that the three optix calls, optixAccelComputeMemoryUsage, optixAccelBuild, and optixAccelCompact are usually very fast, taking less than 1ms, but the entire buildGas takes about 50ms, for which ~47ms was spent on copying back the 8B compacted_gas_size variable. Any idea how to optimize this away?

I have 10 such BVHs to build offline. Since they are independent I put them in different streams and call buildGas 10 times back to back. Now since the memcpy dominates and the copy from different streams are all in the same D2H direction I think the copy has to be serialized, which degrades the performance quite a bit.

Here are some further measurement results:

On ~50M AABBs:
optixAccelComputeMemoryUsage: 0.135696 ms
optixAccelBuild: 0.374136 ms
copy: 190.367 ms
optixAccelCompact: 0.074713 ms
total time buildGas: 220.286 ms

on ~10M AABBs:
optixAccelComputeMemoryUsage: 0.125716 ms
optixAccelBuild: 0.640512 ms
copy: 41.3283 ms
optixAccelCompact: 0.042754 ms
total time buildGas: 48.0881 ms

on ~1M AABBs:
optixAccelComputeMemoryUsage: 0.115211 ms
optixAccelBuild: 0.296703 ms
copy: 4.57826 ms
optixAccelCompact: 0.024176 ms
total time buildGas: 6.59666 ms

I guess I am quite curious why the cudaMemcpy time, for the same 8-byte data, would vary depending on the AABB size.

Also from NVVP, I can see a whole bunch of NVIDIA Internal activities going on while building the Gas, which I suppose isn’t captured by the three optix calls, but contributes to the total time. Any idea what that might be?

OK I am being stupid here. I measured the optixAccelBuild without proper synchronization. With sync, the build time dominates.

This is a good question!

First, it sounds like you might need to double-check your timing methodology. It’s very easy to get misleading timings from CUDA kernels, and that’s what I think is happening here. For one thing, you cannot measure the time of an optix* function call on the host (or a CUDA kernel launch either). All these host functions are asynchronous: they issue enough data for the kernel to eventually launch, and then they return immediately, potentially before the kernel is actually launched, and often long before the kernel is finished.

Most likely, your optixAccelBuild is taking the longest, and the cudaMemcpy is quite fast. The reason cudaMemcpy seems to take longer is because it waits for the other operations to complete first - the dependency will cause serialization.

One of the best ways to time your kernels is to insert CUDA stream events immediately before and after each kernel. You can also do this using stream callbacks (see cudaLaunchHostFunc CUDA Runtime API :: CUDA Toolkit Documentation)

A much less good, but reasonable thing to do if you’re in a hurry is to sprinkle cudaDeviceSynchronize() on either side of a kernel that you’re trying to time. So, sync, then start your timer, then launch your kernel, then sync again, then stop your timer. This ensures that your timing is only measuring the kernel you care about, and that your timer is waiting until the kernel is finished, accounting for all the time.

So… with that in mind… you are seeing optixAccelBuild go much faster than it really does. It is fast, but not quite that fast. ;)


David.

Now, as to how to optimize these things. I’ve heard from the BVH team that their experiments are showing 2-4 streams to be effective at increasing speeds, and that more streams than that usually doesn’t seem to help much. The amount of benefit you’ll see from multi-stream BVH builds will be inversely proportional to the BVH size. If you are building very large BVHs, then the build is going to do a decent job of saturating the GPU. If you are building many small BVHs, that’s when using a few streams is going to help the most.

If you still see the cudaMemcpy having any kind of large effect after your timings are synchronized, then one option is to collect your compaction results into an array and do a single copy instead of copying the result from every BVH build separately. This still serializes, but you can eliminate some of the overheads of the device communication by grouping all the BVH sizes together. Normally, the copy should be fairly small and negligible. This strategy is also one I would recommend considering mostly for the case of building many small BVHs, where you would combine it with the multiple streams and also with batched memory allocations.


David.

Thanks for the detailed and patient response. As mentioned in an earlier post, I did make the mistake of not properly synchronizing for time measurement!

I experimented with building 5 BVHs back to back. Each has about 1M AABBs. From NVVP, they are completely serialized. Could this be because optixAccelComputeMemoryUsage is synchronous (as it doesn’t take a stream argument) and so that’s blocking the overlapping across different builds?

It sounds like you’re setup for parallel streams correctly, and sometimes seeing the kernels overlap, right?

Most likely this is just a case of each BVH being large enough to saturate the GPU. Even when kernels are issued to run in parallel, they may execute mostly serially for best performance. It’s better for a large kernel to hog the GPU resources until it no longer has enough threads to occupy the entire device, and then another kernel may start to execute. So, I’m guessing if you’re setup correctly and if you profile carefully, you’ll see there is some overlap, just not very much - and this is okay and expected.

optixAccelComputeMemoryUsage() is a host-side function, it doesn’t run on the GPU at all, and so it does not affect GPU serialization. However, all memory allocation and de-allocation can cause serialization across the entire device - even when multiple streams are active. So, it is best to avoid allocating and de-allocating in the middle of a lot of compute work. Performance-wise, you’re better off calling optixAccelComputeMemoryUsage 10 times in a row before you start, gathering all the sizes, then allocating a single block large enough for all 10 BVHs, running optixAccelBuilds in parallel with each one placing it’s result at the correct spot in your large buffer. Then you can repeat for compaction: gather all sizes at the same time, then allocate a single buffer for all the results, then issue all the compaction kernels at the same time, then deallocate your initial buffer. This way you avoid the copies and the memory management during your build & compaction work.

The above is just an example of how to make the parallel BVH builds fast. It might not be the best approach for managing memory, since if compaction is needed, it’s often the case that allocating all BVH memory in a single block is not possible. You can do something in-between, especially if you have a lot of BVHs. If you were building 1000 BVHs, you could group together 10 of them at a time and get better performance than building one at a time.


David.