How is the performance of the huge amount of rays in this small scene?

I know that the ray tracing core has significant performance advantages when dealing with a large number of rays in a complex scene. But I’m not sure whether using the ray tracing core (Optix) in the following application scenarios offers any performance advantage over simply using the CUDA core (pure CUDA programming)?

  1. The scene becomes extremely small. For instance, a simple scene consisting of only a few 30-40 triangles.
  2. The number of rays is several orders of magnitude greater. For example, 2^32 rays.

In this scenario, does Optix demonstrate better performance than pure CUDA (perhaps because there are many rays)? We can observe that compared to the typical advantageous scenarios (for example, triangles with 2^26 and 2^28 rays.), the scene has become smaller while the amount of rays has increased.

I also have doubts about the performance of the following scenarios.

  • what if there are numerous small scenes that need to be rendered separately? For example, 2^26 independent scenes each containing 30 to 40 triangles. Independence means that I only render each small scene separately each time. My thinking is that, when only a small scene is being rendered, creating a huge scene that encompasses all of those small scenes does not seem to offer any cost advantage.
  • Or, a single thread could emit 30 to 40 rays (a single thread initiates 30 to 40 instances of optixTrace). For example, 2^26 threads, and each thread emits 30 to 40 rays. In this scenario, could it be that due to multiple calls to optixTrace, the startup overhead would be very high? Note that the scene is very small, so the traversal might be quite fast.

I am very interested in the performance of the above application scenarios. Could you perhaps offer some suggestions?

Hi @hkkzzxz24,

This an interesting question!

I don’t know the answer with any certainty, as it depends on your application’s specifics, but my guess/instinct is that OptiX should be able to help with both cases, with some assumptions and caveats. It’s true that OptiX is designed for large scenes and there are some overheads and design choices that were made for film production renderers that don’t necessarily represent the best you can do for small scenes or experimental rendering techniques.

For case 1 (small scenes), the triangle intersections are still hardware accelerated. The answer to this question depends on what kind of CUDA code you’re willing to write when not using OptiX. If the scene was small enough and simple enough, it’s plausible that you could save time by not using a BVH at all, and write the CUDA code in such a way that you hard-code “traversal” into your rendering algorithm - essentially hand-code your acceleration structure. People do tricks like this in ShaderToy shaders, for example. In that case, you might be able to make the CUDA code run faster simply by removing the BVH build from your workflow. That could be a very developer time-consuming rabbit hole though, and comes with serious limitations in what kind of scenes you can handle – so it’s possible, but it doesn’t scale.

Also for case 1, it depends on what kind of shading you need. If you have only 1 shader or no shaders, and shading is very simple, the hardware acceleration is almost guaranteed to be faster than a CUDA renderer, even with very tiny scenes.

With case 2 (many rays) - I would think this tends to favor OptiX because time spent ray tracing will dominate over small overheads. However, this may depend on what you do for case 1. If you found a way to make ray tracing faster for 1 ray, then it’s possible you could scale up and make tray tracing faster for many rays.

For what it’s worth, we have spoken with some AI researchers who asked similar questions, especially with very large numbers of small scenes, and sometimes the limiting factor is the BVH builds, and not the rendering at all. Creating a huge scene to encompass all the small scenes might not seem like a cost advantage when rendering, but you will coalesce all the overheads of launching many small BVH builds into a single BVH build and a single launch, and it’s likely to save a lot of time for that reason. If you can take advantage of the new cluster API, then you might be able to save even more time that way.

I don’t understand the final question/scenario about threads and rays, but I’ll ramble a little more and you can tell me if I’m not answering your question, okay? There’s no real startup overhead to tracing a ray or to starting a thread, really, other than the nanosecond-level overheads of using the RT cores. It’s fine to cast 1 ray per thread or to cast 40 or even thousands of rays per thread. The main thing you need to do performance-wise is to cast the same number of rays for each thread, at least within a warp. If you cast more rays in some threads, then the threads with fewer rays will need to wait for the threads in the warp that have more rays to finish, and that introduces code divergence in the warp, and leads to inefficiency.

The other thing to ensure is that when using RT cores, the work for a ray stays on the RT core until it hits it’s final destination. By this I mean for highest performance, and lowest overheads, avoid the use of custom intersection programs and anyhit programs. Use the built-in hardware triangles, and make sure to disable anyhit explicitly. Use the terminate-on-first-hit flag when you can, e.g., with shadow rays. If you have shaders that need to run during a ray’s traversal, that’s when RT core overheads can stack up.

You should also look at the SER API in OptiX. For simple scenes, you can call optixTraverse instead of optixTrace, and avoid having any shaders called at all. And if you have any divergence issues, you can consider using optixReorder to iron them out. It has it’s own overhead to balance, but for people that have had bad divergence problems, reorder has in some cases improved performance by multiples.

Of course, if there are things we can do to improve performance, we are interested to hear about them.


David.

1 Like

Thank you, dhart.

I understand. This seems to bring me good news.

Yes, I am more than willing.
As you mentioned, I now understand that optixTrace itself does not have a very large initial overhead.

The concern I have is that, when multiple rays are emitted within a single thread (i.e., multiple calls to optixTrace are made, such as 30-40 times), it actually results in a bouncing of the execution flow between the RT core and the CUDA core. Will this affect the performance? These jumps result in poorer performance compared to running purely on the CUDA cores, because running purely on the CUDA cores does not have the hidden overhead of the jumps.

To put it vividly,

  • optixTrace enables the light to be traced to the RT core.
  • After completing the ray tracing, return to the CUDA core to execute the closest_hit shader
  • Back to Raygen, that is, back to the CUDA core
  • in raygen, Call another next optixTrace, return to the RT Core, and repeat the above process 30 to 40 times.

It can be seen that there are quite a lot of bounding issues between the RT core and the CUDA core in the program. And this is precisely what the best practices recommend to be avoided as much as possible(if possible, disable any_hit). My concern stems from this aspect:

Will the hidden overhead caused by the repeated jumps between RT Core and CUDA Core lead to a situation where the performance improvement achieved by using RT Core is nullified, or even worse than using CUDA Core alone?

Restate my application scenario:

  • Many small scenes in the 30-40 triangle: This implies that the traversal cost is lower than the conventional one.
  • One thread emits a large number of rays: the total optixTrace is greater than that of the conventional method. For example, there are 2^26 threads. In a regular scene with 1 thread and 1 ray, the number of emitted rays is 2^26. In my scenario, one thread emits 30 to 40, or even 100 rays, resulting in a total of approximately 2^26 * 30 ≈ 2^26 * 2^5 = 2^31 rays.

When using the RT cores, there is no additional overhead of tracing multiple rays per thread compared to tracing one ray per thread (and potentially using more threads). So tracing 30-40 rays per thread is perfectly fine, as long as all threads in the warp do the same computation and trace the same number of rays.

The concern about additional overheads that I mentioned is when doing multiple round-trips to the RT cores per ray. The problem with custom intersection programs and anyhit shaders is that they are potentially called many times per ray, sometimes hundreds or thousands of times per ray in large scenes. This is why the tiny overhead of going to the RT cores can add up to a lot. The overhead of a single round trip for a single ray is quite small.


David.

1 Like

Appreciate it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.