How much performance improvement can hardware accelerated ray tracing provide?

Hello! I implemented a ray tracing task using OptiX 5.1, and the execution time on TITAN X(Pascal) was on the order of 10^2. However, I implemented the same ray tracing task with OptiX 7.1, and the execution time on an RTX 4090 was on the order of 10^(-1). I wonder if such a high performance improvement (1000x) is justified? Or is there something wrong with my implementation? Thanks in advance!

Hi @thiltuiv,

Generally speaking, that seems like a larger ratio than I would expect out of a fair best-effort comparison on both sides. The ratio should absolutely be larger than 1 order of magnitude. Two orders of magnitude is plausible, depending on what you’re doing, though I might expect the result to be somewhere in between 10x and 100x. However, we would need to discuss what you’ve done in much more detail to answer the question or be sure of anything. 1000x sounds less likely, and more like something could be wrong on your end. However, 1000x is not out of the question, and this will depend on all kinds of things including which ray tracing features you’re using, which SM features you’re using, how much of each type of memory you’re using, what the bottlenecks are in each implementation, how exactly you are timing your results, etc., etc.

Of course, I must again reiterate that many many many things have changed between OptiX 5.1 and OptiX 7.1, so this can never be an apples-to-apples comparison; even if the perf ratio is plausible, it will be very difficult to account for all the differences. And to some degree it’s becoming moot since current OptiX offers algorithmic ways to gain performance that OptiX 5 did not have, such as hardware accelerated SER, and payload semantics, in addition to hardware accelerated traversal & intersection.

May I ask what the goal of this comparison is? Maybe there ways we can help you achieve your goals and save time instead of trying to implement and understand how a very old API and old hardware performs, since even though you have an initial working implementation, it could still be a very long and difficult process to find a satisfactory answer with high enough confidence. For what it’s worth, to me this sounds like a difficult question to answer even with our internal knowledge and tools.

–
David.

1 Like

Thank you very much for your reply, our work link is: [2412.09337] RTCUDB: Building Databases with RT Processors. We made this comparison in response to a question raised by a reviewer who questioned that our performance gains do not come from using RT cores. By the way, our implementation is based on the simple samples provided by OptiX (optixHello, optixTriangle), without further optimization.

Very interesting work!

Indeed it’s going to be difficult to prove the RTcores are responsible, without a way to change only one thing at a time, and that doesn’t currently exist. However, traversal and triangle intersection is faster with RTCores than in software (otherwise we would not build RTcores). Some of your performance gains must come from RTcores, so the question is perhaps not whether hardware is accelerating your work, but by how much. (I guess that is exactly the question you’re asking here.) One way to consider framing your argument is that perhaps your use of ray tracing for database queries is a novel algorithmic improvement on it’s own, that may accelerate queries through your geometric interpretation leading to improved data access patterns, and can be further accelerated by using ray tracing hardware. Perhaps this framing gives your paper and your algorithm more of the credit than the claim that all of the acceleration is due to use of specialized hardware. ;)

One approach to consider is to maybe write your own CUDA ray tracing & traversal algorithm and compare it to Crystal and to OptiX. This can help answer the question of how much of your algorithm’s speedup is potentially due solely to reframing the database query problem as a ray tracing problem. This could be a slightly unfair comparison to OptiX & RTcores because a very small well-tuned fixed-function traversal algorithm may in some cases be able to out-perform the software version of OptiX, since OptiX has a more complex programming model and all kinds of production level features that you don’t need.

Okay so that said, I guess that optixHello & optixTriangle are far too simple to use for comparing perf. Neither of those will saturate the GPU. How exactly are you measuring the performance of these samples in each case (OptiX 5 vs OptiX 7)?

Maybe try getting an estimate of rays per second using the optixPathTracer sample. You’ll need to estimate the ray count per frame (without slowing it down, as discussed in a recent thread), and ensure the same render algorithm and settings (path depth, samples per pixel, resolution, etc.) Make sure you’re using Release builds only when comparing perf, and that any debug features are disabled or compiled out. optixPathTracer rays per second will probably give you a better ballpark estimate than you can ever get with optixHello or optixTriangle. Lock your graphics clocks, start your measurements after several frames have rendered, and maybe average the results over 100 frames. All the caveats I’ve mentioned still apply. The resulting perf ratio still will not implicate or prove anything about software vs hardware ray tracing, because too many things changed between OptiX 5 and OptiX 7, but I do think the ratio will be more plausible and a bit lower than 1000x.

–
David.

1 Like

I am sorry that my statement is not clear enough to cause you to have a misunderstanding. I am not using optixHello & optixTriangle to compare performance. What I mean is that my own work imitates the simple samples above in the creation of acceleration structures, modules, program groups, pipelines, etc.

Ah, got it, thank you for clarifying. Well I think the same advice above would apply to profiling your own implementation too. If you still want to compare OptiX 5 + Titan X to OptiX 7 + 4090, then make sure the launches are enough to saturate the GPU, and make sure your perf measurement process is either narrow enough in scope to accurately capture the launch kernel timings, or broad enough end-to-end testing to measure frame rate and estimate rays per second.

My suspicion is that your 1000x is coming from measuring very different things between the OptiX 5 and OptiX 7 implementations. With OptiX 5, many things are going on under the hood, like memory allocations, SBT and scene management and other things, and these things will probably be hard to isolate and account for in your timing. With OptiX 7 since allocations and management of the SBT and the scene are all under your control, you might be including less of the setup and host-side work in your timings than with the OptiX 5 path. Maybe measuring performance starting after the first frame and looking at multiple frame render times for a static scene will give you a more realistic comparison.

I’m only guessing here based on very limited info. We’re happy to look at the profiling methodology in more detail if you’d like. Because measuring performance is easier and more controllable with OptiX 7, that’s one reason it will be easier to compare to your own CUDA ray tracing kernel if you decide to go that route.

–
David.

1 Like