No, I don’t synchronize BVH builds, I just have synchronizations between measurements.
Ah, yes, sorry! I missed that detail the first time through.
So there are 2 things you can do to get better performance with many small GAS builds:
(1) use multiple streams
(2) use OPTIX_BUILD_FLAG_PREFER_FAST_BUILD. OptiX will provide some improved performance, and possibly improved batching behavior when using the fast build flag.
I added multiple streams to your sample ([1] code at the bottom), and I see the following result with your default OPTIX_BUILD_FLAG_PREFER_FAST_TRACE:
1 stream:
GAS build time over 100 runs (ms): min=62.021 median=69.948 max=75.705
4 streams:
GAS build time over 100 runs (ms): min=13.876 median=14.221 max=15.205
Single GAS
GAS build time over 100 runs (ms): min=0.647 median=0.732 max=0.840
Please note that when I pass the flag --reuse-scratch when using multiple streams, the app crashes. This is to be expected, for the reason I gave earlier.
Here’s the result I get when using OPTIX_BUILD_FLAG_PREFER_FAST_BUILD, which provides a considerable improvement in the case with only 1 stream, and a modest improvement to multiple streams:
1 stream:
GAS build time over 100 runs (ms): min=33.657 median=46.917 max=51.578
4 streams:
GAS build time over 100 runs (ms): min=11.408 median=11.856 max=12.974
Single GAS
GAS build time over 100 runs (ms): min=0.448 median=0.501 max=0.570
On my Windows machine with a 5090 I don’t seem to see any benefit on this scene when using more than 4 streams, but that can depend on the distribution of GAS sizes (potentially as well as device model and driver version and other things). In general I’d guess that you might want to test up to 8 streams, but might not expect to see improvements beyond that.
[1] Here’s the measurement loop for multiple streams. Obviously there are a few other changes needed in the app, but they should be simple/straightforward.
for (int m = 0; m < measureCount; ++m) {
gasBuildTimer.start(cuStream[0]);
for (int g = 0, s = 0; g < static_cast<int>(built.size()); ++g) {
Geometry &entry = built[g];
cudau::Buffer &scratchBuf = reuseScratch ? sharedScratch : entry.scratch;
entry.handle = entry.gas.rebuild(cuStream[s], entry.gasMem, scratchBuf);
s = (s + 1) % numStreams;
}
for (int s=0; s < numStreams; s++) CUDADRV_CHECK(cuStreamSynchronize(cuStream[s]));
gasBuildTimer.stop(cuStream[0]);
buildTimesMs.push_back(gasBuildTimer.report());
}
Incidentally, I disabled the dummy kernels in your sample. I didn’t see any timing differences, and I wouldn’t expect any. I’m happy to share back my modified version if you want it.
–
David.