Proper way of batching multiple GAS builds

Hello,

I’m learning the proper way of batching multiple GAS builds.

Best Practices: Using NVIDIA RTX Ray Tracing (Updated)

All BLAS build calls need unique scratch memory to allow execution without barriers.

  • Is this applicable to OptiX also?
  • Are there any restrictions in CUstream to use?
  • Is it disallowed to insert other commands like CUevent between optixAccelBuilds?

Thanks

Hi @shocker.0x15,

That advice does apply to OptiX accel builds, yes. It’s just another way of saying that all of the scratch memory buffer passed to optixAccelBuild can be used by the build, and so obviously you can’t overlap multiple builds that point to the same scratch buffer, otherwise one will corrupt the other. If you are memory constrained, then you certainly can serialize your BLAS builds and re-use the scratch buffer by adding a synchronization barrier of some kind in between the builds.

I don’t think there are any restrictions to stream use that need mentioning. You can launch different accel builds on separate streams, as long as they all have separate scratch buffers. One thing I should mention is that OptiX does have some internal batching optimizations for times when multiple AS builds are queued up on the same stream, and so using multiple streams may or may not help improve performance beyond what we’re already doing. This is one reason to avoid using barriers between builds on a given stream - that will prevent OptiX from being able to schedule overlapping work across builds.

You are free to insert events between accelbuilds - and we recommend doing that for performance measurements.


David.

Is this really applicable?

I made a small program to validate batching multiple BLAS builds.
repro_batched_blas.zip
Use CMake to generate a project file.
usage: <path_to_obj_file> [--single-gas] [--reuse-scratch]

I used Crytek sponza obj model from McGuire Computer Graphics Archive for example to compare the performance of batched build vs unbatched builds (reusing the scratch memory). To mitigate unstable nature of performance measurement in Windows environment, the program measures build time 100 times (by default) and print min, median, max.

The results on my environment are:
Batched: min=74.122 median=77.386 max=84.102
Unbatched: min=74.081 median=77.690 max=80.972
Single Big GAS (just as reference): min=0.825 median=0.865 max=1.271

So, as you see, there seems no difference in batched and unbatched build and batched build looks a bit too slower than single big GAS build (93x diff!) (although there should be somewhat overhead in multiple GAS builds).

I suspect one of the following:

  • I have some mistake in my code
  • OptiX runtime has an issue in batching (because of driver and/or hardware).

My environment:
OptiX 9.1.0
CUDA 13.1
Driver 591.74
Geforce RTX 4080
Visual Studio Community 2022 17.14.25
CMake 4.0.3
Windows 11 Pro 25H2 26200.7628

Hi @shocker.0x15,

I took a quick look at the sample, and I have a question: what do you mean by ‘batching’ exactly?

It looks like your sample only uses 1 stream, is that correct? If so, all the work you queue on that stream will serialize. This means you’re not really batching BVH builds, as far as I can tell. You also have a cudaEventSynchronize() call in between BVH builds, inside of your call gasBuildTimer.report(). This is an execution barrier, which is what the initial quote you posted is referring to.

In order to batch multiple BVH builds, at least how I would define batching, you need to use multiple streams, and queue up different GAS builds on different streams.

Note that reusing your scratch buffer for parallel work will result in undefined behavior! It is currently working because you are synchronizing your BVH builds. Once you remove the syncs and use multiple streams, then reusing your scratch buffer might work accidentally, or it could result in broken BVH builds, or application crashes. It is simply not allowed to use a single scratch buffer for multiple parallel BVH builds. This is true for any allocation of any buffer with any type of parallel executions, on the GPU or the CPU. This is what the initial quote in your first message is saying. You can use a single allocation that is sized to be the sum of all scratch sizes needed, and provide separate offsets into this scratch buffer for each BVH build, so that they all write to separate partitions of your scratch buffer. But you must not allow your BVH build scratch memory to overlap during parallel builds, even if it initially appears to work.

Additionally, there can be fairly high overhead for very small GAS builds, and the Sponza model has many very small GASes when you build one GAS per group. There are multiple groups with only 2 triangles, and many groups with less than 100 triangles. From a performance perspective, it is a terrible idea to build such small GASes individually. This is why you are seeing the 2 orders of magnitude difference between individual GAS builds and the whole-model GAS build.

To check the performance of GAS batching, you would need to compare building all the OBJ groups on 1 stream to building all the OBJ groups using multiple streams. Comparing to building a single GAS is kind of a different thing, of course, and that is comparing apples and oranges. I think you should expect building 1 large GAS to be faster than building many small GASes. Once you are using multiple streams and letting the many small GAS builds run in parallel, then you can compare that to using just 1 stream. Note that OptiX also does some batching behind the scenes when queueing up multiple GAS builds on a single stream, so there is a difference between using 1 stream with syncs between GAS builds, using 1 stream with no syncs between GAS builds, using multiple streams with syncs between GAS builds, and using multiple streams with no syncs between GAS builds. We currently expect there to be diminishing returns to using more than 4-8 streams for GAS builds.


David.

Hello @dhart!

It looks like your sample only uses 1 stream, is that correct?

Correct.
I was assuming the following:

  • A BVH build has many internal shaders and syncs, like
    Shader A - Sync A - Shader B - Sync B - Shader C - Sync C - ....
  • If we have multiple BVH builds on the same stream, command sequence will be:
    Shader 0A - Sync 0A - Shader 0B - Sync 0B - Shader 0C - Sync 0C - ... - Shader 1A - Sync 1A - Shader 1B - Sync 1B - Shader 1C - Sync 1C - ...
    but this should results in poor performance when each BVH is too small to fill the entire GPU.
  • So I was assuming some optimization on OptiX runtime side that creates a command sequence like the following instead:
    Shader 0A - Shader 1A ... - Sync A - Shader 0B - Shader 1B ... - Sync B - Shader 0C - Shader 1C ... - Sync C - ...
    , where synchronization points are shared across BVHs.

It is currently working because you are synchronizing your BVH builds.

No, I don’t synchronize BVH builds, I just have synchronizations between measurements.

No, I don’t synchronize BVH builds, I just have synchronizations between measurements.

Ah, yes, sorry! I missed that detail the first time through.

So there are 2 things you can do to get better performance with many small GAS builds:
(1) use multiple streams
(2) use OPTIX_BUILD_FLAG_PREFER_FAST_BUILD. OptiX will provide some improved performance, and possibly improved batching behavior when using the fast build flag.

I added multiple streams to your sample ([1] code at the bottom), and I see the following result with your default OPTIX_BUILD_FLAG_PREFER_FAST_TRACE:

1 stream:
GAS build time over 100 runs (ms): min=62.021 median=69.948 max=75.705

4 streams:
GAS build time over 100 runs (ms): min=13.876 median=14.221 max=15.205

Single GAS
GAS build time over 100 runs (ms): min=0.647 median=0.732 max=0.840

Please note that when I pass the flag --reuse-scratch when using multiple streams, the app crashes. This is to be expected, for the reason I gave earlier.

Here’s the result I get when using OPTIX_BUILD_FLAG_PREFER_FAST_BUILD, which provides a considerable improvement in the case with only 1 stream, and a modest improvement to multiple streams:

1 stream:
GAS build time over 100 runs (ms): min=33.657 median=46.917 max=51.578

4 streams:
GAS build time over 100 runs (ms): min=11.408 median=11.856 max=12.974

Single GAS
GAS build time over 100 runs (ms): min=0.448 median=0.501 max=0.570

On my Windows machine with a 5090 I don’t seem to see any benefit on this scene when using more than 4 streams, but that can depend on the distribution of GAS sizes (potentially as well as device model and driver version and other things). In general I’d guess that you might want to test up to 8 streams, but might not expect to see improvements beyond that.

[1] Here’s the measurement loop for multiple streams. Obviously there are a few other changes needed in the app, but they should be simple/straightforward.

for (int m = 0; m < measureCount; ++m) {
    gasBuildTimer.start(cuStream[0]);
    for (int g = 0, s = 0; g < static_cast<int>(built.size()); ++g) {
        Geometry &entry = built[g];
        cudau::Buffer &scratchBuf = reuseScratch ? sharedScratch : entry.scratch;
        entry.handle = entry.gas.rebuild(cuStream[s], entry.gasMem, scratchBuf);
        s = (s + 1) % numStreams;
    }
    for (int s=0; s < numStreams; s++) CUDADRV_CHECK(cuStreamSynchronize(cuStream[s]));
    gasBuildTimer.stop(cuStream[0]);
    buildTimesMs.push_back(gasBuildTimer.report());
}

Incidentally, I disabled the dummy kernels in your sample. I didn’t see any timing differences, and I wouldn’t expect any. I’m happy to share back my modified version if you want it.


David.

I extended my test program to specify the number of streams to build GASes like your snippet, then I observed significant speedup up to 8 streams.

1 stream:
min=67.094 median=82.959 max=97.328
2 streams:
min=23.554 median=31.994 max=40.796
4 streams:
min=11.213 median=11.661 max=17.658
8 streams:
min=6.118 median=6.928 max=12.187
16 streams:
min=6.269 median=6.990 max=12.805

So multiple-stream strategy is actually working,

Note that OptiX also does some batching behind the scenes when queueing up multiple GAS builds on a single stream

but I’m still wondering what the batching optimization that OptiX runtime does internally is. Is this something similar to what I imagined?

So I was assuming some optimization on OptiX runtime side that creates a command sequence like the following instead:
Shader 0A - Shader 1A ... - Sync A - Shader 0B - Shader 1B ... - Sync B - Shader 0C - Shader 1C ... - Sync C - ...

With this image in my mind, I tried another option where GAS build items are sorted based on its triangle count (sorted descending), but the effect was hard to judge (Sometimes it looks effective but sometimes not, because of timing fluctuation in Windows).

BTW, can you trace the program with the latest NSIGHT Systems?
I tried, but NSIGHT makes loading obj data infinitely long for some reason.

Thanks,

I observed significant speedup up to 8 streams.

Hey that’s great! My attempt might have been too simple, but yes you’re seeing better multi-stream perf with 8 streams than I did with my quick test.

I’m still wondering what the batching optimization that OptiX runtime does internally is. Is this something similar to what I imagined?

So the primary single-stream optimization in OptiX is facilities for reducing the overhead of BVH launches compared to Optix + drivers of the past. This primarily benefits small BVH builds, where the launch overhead is a larger percentage of the total BVH build time, and it allows a high throughput of small BVH builds. This set of optimizations doesn’t change the order or kernels nor affect which BVH builds can or cannot run in parallel. My earlier comment here was probably a bit too strong, it’s obvious that for this example, using multiple streams is the way to go.

Aside from that, keep in mind that any given BVH build is comprised of multiple kernels, and that OptiX currently has no way to communicate resource hazards through the API, i.e., whether two different BVH builds share the same scratch buffer and thus cannot be run in parallel. This is directly related to your original question here. We might be able to add more low level resource barrier information to the API or merge BVH builds in the future but without more information there’s a limit to what reordering we can do safely, so it’s currently better for users to manage multi-stream batching and resource sharing scenarios.

With this image in my mind, I tried another option where GAS build items are sorted based on its triangle count (sorted descending), but the effect was hard to judge (Sometimes it looks effective but sometimes not, because of timing fluctuation in Windows).

Hey I had the same thought and tried the same thing with your sample ;) sorting both increasing and decreasing. (And I tried a few variants of grouping same-sized BVH builds per stream vs spreading them out.) I didn’t mention it in the last comment because I didn’t see any reliable improvements, and a few configurations seemed to consistently result in a few percent slower overall throughput.

can you trace the program with the latest NSIGHT Systems?

It worked for me with driver 591.74 and nsys 2025.5.2. Maybe I didn’t get stuck on obj loading because I did not run with admin privileges. Here are the command lines I used:

To capture the profile:

"C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.5.2\target-windows-x64\nsys.exe" profile bin\RelWithDebInfo\multi_blas_from_obj.exe ..\sponza.obj --measure-count 100
WARNING: CPU context switches trace requires administrative privileges, disabling.
WARNING: CPU sampling requires administrative privileges, disabling.
Collecting data...
Generating 'E:\TEMP\nsys-report-fb8a.qdstrm'
[1/1] [========================100%] report2.nsys-rep
Generated:
        repro_batched_blas\build\report2.nsys-rep

To view the profile:

"C:\Program Files\NVIDIA Corporation\Nsight Systems 2025.5.2\host-windows-x64\nsys-ui.exe" report2.nsys-rep

Here’s what I see:

Close-up of 3 unsorted batches:

Close-up of 3 sorted batches:

Maybe collecting all the smallest 2-triangle BVH builds together stacks the highest overhead builds back to back and causes an increase in latency even though very little actual work is getting done near the end of the batch. These 2-triangle GASes are just too small to justify launching kernels…


David.