Any lower access to RT core than OptiX?

joey.skeys · August 12, 2023, 12:36pm

Hi guys, I would like to know that if there’s any access to the RT core ray tracing ability lower than OptiX API level.
I’m asking this because I would like to do hybrid ray tracing which utilizes both CPU&GPU. Of course there’s one possibility that I only render a single tile during each optixLaunch, and then dispatch the tiles both on CPU and GPU. But I don’t know if using OptiX in this way will cause any performance problem, or is OptiX ok to be used this way?
Another possibility is to write the whole BVH and intersection test in CUDA and then I can have the full control. It’s totally viable but I dunno if I can have the access to program the RT core in this case?

droettger · August 14, 2023, 8:26am

You can only access the RT cores implicitly through the three ray tracing APIs: OptiX, DXR, and Vulkan Raytracing extensions.
There is no way to program the RT cores directly because they have changed with every GPU generation and exposing their instructions would have limited these advancements.

The DXR and Vulkan API have two levels of ray tracing functionality, one using the full pipeline model like OptiX and one using explicit ray queries. That’s as low-level as it gets.
Links to the respective extensions for Vulkan in this post: https://forums.developer.nvidia.com/t/what-are-the-advantages-and-differences-between-optix-7-and-vulkan-raytracing-api/220360

There are different approaches to a renderer implementation possible as well. Some use the ray tracing APIs (and thereby the RT cores on RTX boards) only for the BVH traversal and ray-primitive intersection in a ray wavefront approach but do all ray generation and shading calculations in native CUDA kernels between the launches for performance and algorithmic freedom, but that requires intricate asynchronous programming and optimal memory accesses.
On the other hand, the OptiX SDK 8.0.0 release just added support for Shader Execution Reordering for Ada Lovelace GPUs which can greatly improve performance in renderers with a standard ray tracing pipeline with only little changes.

There shouldn’t be a problem running CPU and GPU calculations in parallel.
But if you’re mixing different ray tracing implementations, you will hardly be able to generate the exact same images. Means it’s not possible to get your CPU and GPU images pixel exact simply due to the different intersection programs and that could become visible when rendering in tiles, and especially when rendering animations.
Hybrid methods are either not done in tiles but maybe sub-frames of a progressive Monte Carlo algorithm, or usually just as full CPU fallback when there is no suitable GPU hardware available.

Note that the BVH traversal and ray-triangle intersection performance of dedicated hardware implementations will surpass CPU implementations by orders of magnitude.

If your intention is to learn how to do that, go for it. That is a very interesting topic and definitely worth learning, especially if you plan to work on this professionally. But if your intention is to implement an own renderer, learning how to use existing APIs most optimally would provide much quicker results.

Don’t underestimate the amount of developer years which have gone into implementing each of these ray tracing APIs and the underlying functionality. Just read through the OptiX Programming Guide and think about how you would implement the described features yourself.

joey.skeys · August 15, 2023, 8:05am

Thank you for your detailed and professional answer.
I believe the explicit ray query is answer I would like to have, I’ll look into the vulkan ray tracing for more detailed information.
And since you’ve mentioned the consistency of CPU&GPU rendered pixels, I’ve got another question about it. I’m just new to OptiX and what I’ve learned is that the user have ray generator and integrator calculation in full control, and for the shading part I’m using OSL(which will have PTX callable support for the GPU part). What else could possibly cause the difference between CPU & GPU rendered pixel? Sampling of random numbers? Precision of intersection results?

droettger · August 15, 2023, 10:46am

I believe the explicit ray query is answer I would like to have, I’ll look into the vulkan ray tracing for more detailed information.

I actually wouldn’t use low-level ray queries. They aren’t compatible with Shader Execution Reordering.

If you intended to go the CUDA route, I assume you’re familiar with the CUDA programing model, and then OptiX would be a natural fit for an application which is using GPU ray tracing, especially if you’re looking at an OSL to PTX workflow.
Programming shaders in CUDA C++ offers quite some more code flexibility than GLSL/SPIR-V.
Also if you read through the OptiX/Vulkan comparison post I linked to, OptiX offers a number of additional features and works on more hardware configurations.

What else could possibly cause the difference between CPU & GPU rendered pixel? Sampling of random numbers? Precision of intersection results?

I was mostly concerned about the intersection routines. If there is any difference between the implementations when geometry edges are hit or missed, means which adjacent triangle is hit on a shared edge, there might be alignment problems at tile borders. Note that the OptiX ray-triangle intersection is watertight.
This is getting even more complicated for built-in curve primitives in OptiX. Some intersection routines for those have been invented at NVIDIA.

OptiX is using floating point ray and geometry data. If there is a difference in how the actual transformations and intersections are calculated, and that could be things like fused-multiply-add instructions versus individual multiplications and additions, those already result in different precision, also in the intersection distance.
Just have a look into the self-intersection avoidance functions inside the OptiX Toolkit Shader Utilities which are explicitly tuned via CUDA math intrinsics to prevent compiler optimizations which would introduce ULP errors.

Random number sampling shouldn’t be a problem when the sampler is based on integer sequences.
Simple things like additive recurrence algorithms with irrational numbers in floating point could diverge.

joey.skeys · August 15, 2023, 12:06pm

Thank you very much!

I need to do some actual programming to test out the ideas now.

zobb011111 · July 21, 2024, 3:15pm

Thankyou so much for sharing

Topic		Replies	Views
CUDA/RTX CUDA Programming and Performance	4	264	September 8, 2024
How to use legacy OptiX to implement ray tracing that completely relies on CUDA cores? OptiX	2	163	December 2, 2024
Maximizing GPU Utilization Raytracing	2	1500	July 17, 2023
Program RT Cores using OpenGL, Vulkan, or CUDA? Raytracing	5	4473	April 30, 2024
How is the performance of the huge amount of rays in this small scene? OptiX cuda , optix	5	114	January 10, 2026
Running OptiX on CPU OptiX	3	2550	June 14, 2022
How to utilize CUDA, Tensor, and RT cores in one program CUDA Programming and Performance	5	3376	September 17, 2024
Ray triangle intersection intrinsic in CUDA and other OptiX components CUDA Programming and Performance	0	837	December 31, 2020
Optix 7 on A100 OptiX	4	1841	June 14, 2022
Optix and Optix Prime OptiX	18	3202	June 14, 2022

Any lower access to RT core than OptiX?

Related topics