Compaction with BVH update

Hey folks.

I was hoping to use acceleration structure compaction with UPDATE functionality. I have a large scene (many meshes and curves), where refitting is possible. If I build in an async fashion, the overheads are such that my 24 GB GPU runs out of memory, but if I synchronize (thus throwing away the temp buffer), I can successfully render with UPDATEd BVH.

Async provides a big win on my other scenes, and I’d like to be able to employ that on the larger scene also. I saw that by using compaction I can save something like 20% of memory space, which would likely be enough. Unfortunately, then I can not UPDATE directly into the existing compacted BVH (I get illegal memory access), meaning I’d actually have EXTRA memory overhead to maintain a compacted BVH for UPDATE candidates.

Is there something I’m missing, or would it be fair to say that not much (anything?) is to be gained by using UPDATE and COMPACTION together?

Unfortunately, then I can not UPDATE directly into the existing compacted BVH (I get illegal memory access)

Updating a compacted BVH is generally supported by OptiX.
https://raytracing-docs.nvidia.com/optix7/guide/index.html#acceleration_structures#6217

If that crashes with an illegal memory access and not some out of memory issues, could you please first try updating to the most recent display drivers to see if that resolves that error?
If not and you’re not on the most recent OptiX SDK 7.4.0 version, update to that and repeat the experiment.

If the error persists with the newest SDK and display drivers, would you be able to provide a minimal and complete reproducer to be able to file that as bug report for analysis?

Please include the following system configuration information when reporting issues:
OS version, installed GPU(s), VRAM amount, display driver version, OptiX (major.minor.micro) version, CUDA toolkit version (major.minor) used to generate the input PTX, host compiler version.

Is there something I’m missing, or would it be fair to say that not much (anything?) is to be gained by using UPDATE and COMPACTION together?

The resulting size of the compacted BVH heavily depends on the number of primitives, their spatial orientation and esp. on the GPU architecture. BVH compaction can result in quite dramatic size reductions compared to the memory amount required for the original build. I’ve seen savings of a lot more than 20% on RTX boards, though that was without update flag:
https://forums.developer.nvidia.com/t/is-it-possible-to-call-optixtrace-from-custom-intersection/165252/5

https://raytracing-docs.nvidia.com/optix7/guide/index.html#acceleration_structures#dynamic-updates

Thanks for the response Detlef. So just checking… I should be able to do the following:

  1. Build the BVH with UPDATE and COMPACTION as options using T temp memory and A acceleration structure memory
  2. Compactify the BVH into C acceleration structure memory, throwing away A
  3. Update the compacted BVH in C with T temp memory
  4. Update BVH in C with T
  5. Update BVH in C with T…

Basically the question comes down to whether I still need A memory to do updates of a compacted BVH, or if it should be in-place in C with T temp memory?

Another strange thing I’ve encountered is that T seems to be reported the same whether I’m BUILDing or UPDATEing, but the docs indicate that less temp memory should be needed for updates?

I’ve tried this on 495 and 510 drivers with OptiX 7.4.

If I have the correct logic for the compactification and refitting above, then yes, I’ll try to boil this down to a standalone repro.

Yes, that is correct. Steps 1 + 2 and then repeating 3 for each update.

The OptiX SDK 7.4.0 example optixDynamicGeometry uses OPTIX_BUILD_FLAG_ALLOW_COMPACTION | OPTIX_BUILD_FLAG_ALLOW_UPDATE build flags.

My own example code is doing steps 1 and 2 here just without the update flags.
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/src/Device.cpp#L1199

Note the explicit synchronization steps in there because I’m using my own simple single-threaded arena allocator in that specific example which is not synchronized to the streams because it’s usually not calling cuMemAlloc or cuMemFree per allocation!

optixAccelBuild() is one of the API calls taking a CUDA stream argument which makes it asynchronous.

You said you’re using asynchronous builds for performance reasons. Make sure you’re not breaking the allocations while the builder kernel is still using the memory. If adding cuStreamSynchronize() resp. cudaStreamSynchronize() calls after each of your optixAccelBuild() calls fixes the crashes, you’re having a synchronization issue.

EDIT: The last link I gave above describes all cases where an update doesn’t work. It’s more than the topolgy.

My intro_motion_blur example shows the update of instance acceleration structures when changing linear or SRT motion matrices inside the transform hierarchy.
Build: https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_motion_blur/src/Application.cpp#L1784
Update: https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_motion_blur/src/Application.cpp#L2486

Thanks Detlef. The topology is definitely fixed for update cases, and I’m attempting to make sure that allocations are live as long as they are used, but I’ll double check that this still crashes with cudaStreamSynchronize(). I’ll take a look at your example and see if I’m doing anything different. If after everything it still crashes, I’ll try to make a minimal repro

Another strange thing I’ve encountered is that T seems to be reported the same whether I’m BUILDing or UPDATEing, but the docs indicate that less temp memory should be needed for updates?

What’s your GPU?
About what GAS sizes are we talking? (Number of primitives and required scratch space?)

This was on a TITAN RTX 24 GB. When I find some time I can test on 3990 also. The GAS build sizes are small to large, e.g. a few KB to more than 1GB. I found it very strange that the temp buffer sizes would be the same for build and update operations for every single object.

Actually, on my 3990 on 495 drivers, the temp buffers differ significantly between build and update cases. I’ll have to double check my other machine that I didn’t write a bug in my debug code. Still chasing down the crash-after-compact thing

Hi Brian,

Just a tiny side note that you can set the env var CUDA_LAUNCH_BLOCKING to 1 in order to debug-simulate having a synchronize after all of your launches. This might be handy to avoid changing & rebuilding code as often. If the crashing goes away when this var is set, it tends to indicate a synchronization issue somewhere. If the crash still happens, then it might be more likely to be a bad buffer size leading to an out of bounds error, or some mixed up initialization options, or something else…

Programming Guide :: CUDA Toolkit Documentation

Also reminder to turn on OptiX validation mode as you debug, just in case it has something helpful to say.


David.

Thanks for all your help Detlef and David. I found a bug in my code. Just knowing that what I was trying was intended use case was very helpful!

As an aside, the SDK samples don’t have a case with UPDATE | COMPACTION that uses indices with triangles (just a vertex buffer), nor is there an example with curves, both of which were deviations in my app from the samples.

1 Like