Updating AS leads to extremely low traversal performance

100,000 spheres are built into AS in the 3D space. Rays are launched from each sphere’s centers, which means 100,000 rays in total. tmin=0 and tmax is a minimal value, so that ray is like a point. When a ray intersects up to 100 spheres, it is terminated. The total traversal time is recorded, which is less than 1 ms.

Then, I update 5,000 (5%) spheres of the AS, and still a ray is launched from each sphere’s center. The traversal time adds up to more than 100 ms.

I know updating AS leads to performance degradation. However, in this scenario, the performance reduces a lot. Is it normal?

Thanks in advance.

This is not enough information to be able to say what is going on.

Please provide your system configuration on all OptiX questions to reduce turnaround times:
OS version, installed GPU(s), VRAM amount, display driver version, OptiX major.minor.micro version, CUDA toolkit version used to generate the input PTX, host compiler version.

100,000 primitives and 100,000 rays are low numbers for a modern high-end GPU.

Updating an acceleration structure can degrade the quality of an AS. The the farther primitives are moved away from their initial locations during that update, the less optimal could become the updated AS.
I wouldn’t expect a 100 fold runtime performance change by just updating 5% of an AS with built-in sphere primtives.

tmin=0 and tmax is a minimal value, so that ray is like a point.

I don’t understand how that is supposed to intersect something.
Could you please give the absolute values for your tmax, the spheres’ radii, and the whole scene extents please?

How does the result change when you’re not updating the AS but rebuild it instead?
If the AS update is resulting in a worse AS structure, then the rebuild should show a different runtime behavior.

How are you measuring the “traversal time” exactly? Show some code.
The optixAccelBuild and the optixLaunch are both asynchronous calls.
In your benchmarking code, did you synchronize correctly between these to not include the actual optixAccelBuild time in your optixLaunch measurement?

Are you benchmarking in full release mode, all input source without debug information, OptixModuleCompileOptions optimizations on full and debug level on none or minimal, OptiX validation mode off?

Thanks for the quick reply! The following is the system configuration:

OS version: Ubuntu 20.04.6 LTS
installed GPU(s): NVIDIA GeForce RTX 3090
VRAM amount: 24GB
display driver version:530.30.02
OptiX version:7.5.0
CUDA toolkit version:11.5
host compiler version:gcc 10.5.0, g++ 10.5.0

tmax=FLT_MIN in C, which is about 1.175e-38. The following figure shows the scene. I create spheres based on points such as ray_a, ray_b and ray_c, and then these spheres are built into the AS. Extremely short rays are cast from each points, ray_a, ray_b and ray_c. The ray must intersect with the sphere whose center is the ray’s origin, because the ray’s origin falls in the sphere. Moreover, the ray will intersect spheres with centers located within a distance less than R (the radius of the spheres) from the ray’s origin (ray_a intersects with sphere B and ray_b intersects with sphere A). The spheres’ radii can ensure each of over 99,000 rays can intersect with 100 spheres.
image

Rebuilding the BVH can achieve similar performance to the initial state, taking less than 1 ms.

This is the way of measuring traversal time:

rebuild/update bvh ...
cudaDeviceSynchronize();

struct timeval t1;                           
gettimeofday(&t1, NULL);
CUDA_CHECK(cudaMemcpy(
        reinterpret_cast<void *>(state.d_params),
        &state.params,
        sizeof(Params),
        cudaMemcpyHostToDevice));

OPTIX_CHECK(optixLaunch(state.pipeline, 0, state.d_params, sizeof(Params), &state.sbt, state.params.ray_origin_num, 1, 1));
CUDA_SYNC_CHECK();
struct timeval t2;                           
gettimeofday(&t2, NULL);
double total_time = (t2.tv_sec - t1.tv_sec) * 1000 + (t2.tv_usec - t1.tv_usec) / 1000;

Yes, this is how I set it up.

The ray must intersect with the sphere whose center is the ray’s origin, because the ray’s origin falls in the sphere.

Then I’m assuming you’re using custom primitives with your own sphere intersection program?

tmax==FLT_MIN would not allow to shrink that during intersection calculations except once to 0.0f. This is a rather weird setup.

The spheres’ radii can ensure each of over 99,000 rays can intersect with 100 spheres.

That means at least 100 spheres are overlapping each ray origin.
That is going to be a really awful acceleration structure to begin with.

Rebuilding the BVH can achieve similar performance to the initial state, taking less than 1 ms.

OK, so you’re saying the updated AS is the problem.
No idea how to solve that with the given information.
Maybe always rebuild it? That shouldn’t take too long for 100,000 primitives either.

Why do you need to solve this with ray tracing?
Calculating overlap of spheres and points could easily be solved with a native CUDA program and some simple spatial grid to reduce the number of checks.

And actually what is this meant to solve at all? Maybe there is a better way.
What is your result data?

Yes!

In this way the spheres which intersect with the ray wrap the ray’s origin, meaning the spheres’ centers have a distance less than R from the ray’s origin. Utilizing ray tracing, we can efficiently determine which points (spheres’ centers) have a distance less than R from a point (ray’s origin), helping solve KNN ( k-nearest neighbor) problem.

Yes, now I always rebuild the AS instead.

Finally, the points which have over 100 neighbors (a distance less than R) are returned. Whether rebuilding or updating AS, the results are the same, but the time spent is different. The time to rebuild AS is acceptable now. But updating AS takes less time. If only updating AS could also have good traversal performance.

If only updating AS could also have good traversal performance.

Hi @novice, FWIW, it is possible to get good traversal performance and use AS updates, but you have to pay attention to how far things move and be careful. The AS update feature is really designed for typical animation, where things move coherently and slowly relative to the whole scene on a per-frame time scale. AS update is not meant to be a fast rebuild when things in your scene change dramatically.

If I understand the thread so far, it sounds like your update of 5% of the spheres might be dramatic position changes? If you are, for example, choosing random locations for these 5% of spheres, then you might be degrading the AS in a very very extreme way. Let me illustrate - if your built AS is producing a structure where you typically have 100 spheres overlapping, and then choose a new random position for 5000 of the spheres… statistically half of them will end up on the other side of the centroid of your scene from their current positions. Their bounding boxes will contain neighbors from the original AS build, so moving them will stretch those bounding boxes out to contain the new random position, and will suddenly overlap a huge swath of the scene. Because primitives are grouped together in the AS leaf nodes, you may actually be ballooning the size of not just 5% of the leaf bounds, but anywhere from 10% to 25% or even perhaps 50% of all your leaf boxes, it really depends on how much grouping was is in the original AS build.

So I might estimate that if you have 100 overlaps in the built AS, then refit after choosing 5000 random positions, you could end up with ~2,500 + 100 overlaps, meaning your new AS might be 26 times worse traversal performance than before, especially in the center of your scene, under the imaginary assumptions I chose here. In reality it might not be that bad, or it might be even worse than my example. Does this make sense?

Okay, so given that, one thing you could try, if you want, is sorting and corresponding your newly chosen positions so they match 1:1 with the closest sphere that’s being deleted or replaced. If you have 100 overlaps on average before the AS update, and you choose 5% new locations, I think that means you could in theory move each of the 5% of new spheres less than 1 sphere radius? Perhaps you can find a way to do this in less time than the difference between an AS build and an AS update. Keep in mind that if you repeat this process, you will need to rebuild your AS eventually, and if you replaced a sphere multiple times, you would probably want to re-do the sorting process against the built AS, rather than the current updated position of the sphere.

FWIW, I’ve done a video for GTC demonstrating some AS update strategies for animated hair. It might give you a sense for how far things can move before the AS traversal performance starts to degrade. Link to the video in this thread: Optix Dynamic Geometry Unstable Rendering FPS - #2 by dhart


David.

Here’s a pictorial example of what I mean, just for fun. First the built AS with group bounds, and some spheres marked for replacement:

Note that there is minimal overlap, only one tiny sliver of space that needs to check 2 group bounds. So even though I only picked 4 spheres to move randomly, I end up with 100% overlap in the center of the scene, where I have to traverse all group bounds.

This is just a contrived example, but hopefully explains clearly how moving only a small subset of your primitives a long distance can severely degrade the entire structure when trying to use a BVH update.


David.

2 Likes

Sorry for taking so long to reply. Thank you very much for your detailed explanation and clear pictures! I understand the reasons for performance degradation now, and I will try sorting and corresponding strategies later.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.