BVH updates slow on RTX 4090 + PRO 6000 Blackwell

I am currently benchmarking triangle BVH updates in OptiX and noticed unusual performance on two systems.

The benchmark creates 2^26 small (surface area around 1) triangles at distinct integer locations and builds a BVH for them (with update flag set), then moves FOUR triangles in the vertex buffer to a different 3D location, followed by updating the BVH with all 2^26 triangles again. I am well aware these non-local updates will negatively impact the structure of the BVH and can increase tracing time, but I am interested only in the time it takes to re-run optixAccelBuild, not the tracing afterwards.

One system contains an RTX 4090 with CUDA 13 and driver 580.95.05. Building the BVH (around 4.2 GB) takes 116 ms, and updating takes 9099 ms. I remember running the same experiment a few years back with updates taking around one third of the build time, which makes more sense.

Running identical code on a different system with an RTX PRO 6000 Blackwell with CUDA 13 and driver 580.95.05 builds in 72 ms, and takes 6443 ms for the update. Same behavior here.

Do you have any ideas on why that might happen? Any settings I can check? Thanks!

Hi @slyphix, intriguing question.

Off the top of my head, I have no idea why this might happen. I will ask some experts on the BVH team for their thoughts and reply again when I hear back.

In the mean time, some questions:

These machines are both running Linux, correct?

How are you measuring the time? Do you synchronize both before & after building to isolate the BVH update? Are you using cuda events for the timing?

Do you see normal update times on some systems? I would expect a normal update/refit to take less time than a rebuild, as you saw before.

Is the long update behavior consistent? Does it always happen, if you update multiple times in a row? Does the long update continue happening if you re-run immediately after a reboot?

Are these systems connected to a display, or are they headless or used in machines with separate display GPUs?

I’m asking these things in part because the one thing that comes to my mind that sounds possibly similar is a report I heard of a machine randomly stalling during CUDA calls such as cudaMemcpy(), after the machine has had at least several days of uptime and a lot of activity. I don’t know if the cause has been tracked down, but I suspect it could be memory fragmentation. I’m told the issue will go away for a while after reboot.


David.

Hi David,

Thanks for responding so quickly.

Both machines are running AlmaLinux 10.1. Both machines are headless. I am measuring the execution time of optixAccelBuild using CUDA events on the default stream. I saw the expected results only a few days ago on the 4090 system, but it changed since then. The admin told me the only relevant event in between the two tests was a reboot of the machine (which seems to indicate another reboot might not help). Unsure whether the reboot activated a new (buggy) driver version.

Currently, the rebuild is only done once during the lifetime of the executable. However, the behavior is consistent across restarts of the executable, and across “update sizes“, i.e., when I move more triangles before the update, the update exhibits slightly worse performance. Build performance remains good and stable. I’ll run some additional tests and report back. Thanks!

Thanks for the added details. It occurs to me that with a synthetic procedural scene, it might be possible that it triggers some kind of bad corner case behavior.

In case this is a bug that needs to be filed and fixed, one thing we could discuss is how to reproduce on our end. Is there code you can share? Or would it be easy to modify an SDK sample to reproduce? Alternatively if you can save out the scene as a .gltf file perhaps, or maybe as code to regenerate the scene, it might be easy for us to check.

It also might be worth trying different drivers, if possible. Sounds like you don’t have admin access to the machines, but if it’s viable, checking an older driver, or a new 590 driver, might uncover a version dependency.


David.

I tried running the same update multiple times during the lifetime of the executable. All updates but the first now exhibit good performance. I also tried running different updates of the same size in sequence, and again, only the first overall update operation was slow, the others were fast. Might be initialization overhead, but this is neither the first CUDA call nor the first OptiX call in the program, not even the first optixAccelBuild call.

For my benchmark, I’ll just run a dummy update and ignore its run time, which is fine for me. But if I had to develop a real-time application, this would be bad. I remember a similar case where the first call to optixTrace turned out to be super expensive, probably for a similar reason. I would love to have a way to force this initialization to happen, so I can run it at a non-critical time. I would appreciate it if you could relay this to the OptiX team.

I can provide code to reproduce the issue, which initially started from the OptiX samples, but has expanded significantly over time, so it will not be nice to run or look at.

In any case, thanks a lot for taking the time to respond.

Small addition: I found an old forum post where I had a similar problem with optixAccelBuild. I have never managed to resolve this problem either: optixAccelBuild sometimes takes very long

Ha! I also forgot about that conversation we had previously.

I am interested to find out, and to get a reproducer if possible. If you’re willing to share it, we’ll take a look.

After re-reading that previous thread another thing comes to mind. CUDA will allocate more local memory if needed before launching a big kernel. It’s possible for this memory allocation to take a long time in some situations (though it’s not normal). This might not be the issue you’re seeing, but there’s a small chance, and if that was happening it could potentially explain the symptoms.


David.

Another thought… how are your 2^26 triangles arranged spatially? Are they in a near-cube-shaped volume, or spread out along an axis (1d) or a plane (2d)? Are they separated by 1 unit, or by random amounts? What’s the approximate scene bounding box? Are the triangles distributed uniformly?

If the scale of the scene gets too large, or if the distribution of triangles somehow gets too lopsided, I can imagine reasons that might cause the refit builder to suddenly freak out and become slow compared to a smaller scene or fewer triangles.


David.