Optix-low computational usage on GPU

Hi,

I’m working on Optix 6.5 in order to ray tracing simulation using Opticks repository (Opticks : GPU Accelerated Optical Photon Simulation using NVIDIA OptiX — Opticks 0.0.1 documentation). I have done successful implantation of the ray tracing simulation for a few geometries and I have gotten the verified results, however the performance in terms of GPU usage both computation and memory allocation is not as high as I expected (the GPU usage is about less than 5% on my Titan RTX GPU card).
I’m looking for general recommendations from you, how to improve the computational efficiency. How would you approach this problem in order to get higher performance?

I’m bounded on using Optix 6.5 and the geometries that I’m targeting are fairly simple geometries.

Any suggestion is highly appreciated!

Thank you!
Ami

Hi @hashemi_amirreza, welcome!

How are you measuring GPU utilization? Be aware that due to the proprietary ray tracing cores, some Nvidia tools do not currently show you a complete picture of utilization; the ray tracing workload may not appear as either compute or memory usage. Often high compute and memory is actually a sign of low efficiency, or simply of complex shaders, so low compute & memory usage is not necessarily bad.

It may be worth focusing on the overall performance, and comparing that to the expected performance. Have you measured your rendering throughput in rays per second? How fast are you expecting it to render, and how fast does it currently render?


David.

Hi David,

I’m monitoring the performance with nvtop, basically run my simulation and monitor the GPU usage and memory allocation, how would you recommend to monitor the GPU usage?

It appears the performance that I observe largely depend how the geometry is defined, in my understanding Optix uses set of triangles (similar to a tetrahedral mesh over the geometry) to trace light through the geometry and it appears that geometry definition largely impact on the performance. Please correct me if you think my understanding is wrong.

Currently I observe 5-15 times performance over the simulation on CPU (intel Xeon E5-2687W v3 @3.1GHz), but I’m looking to have a performance in the order of hundreds.

Thanks,
Ami

The best way I know of to estimate performance expectations, and measure performance is to gather rays per second metrics. Do you know how many rays per second you get on CPU?

You can estimate rays/sec roughly if you have frames per second, or better yet kernel timings, and a good idea of how many rays you cast (e.g., screen resolution multiplied by samples per pixel plus the number of secondary rays). Note that using frames per second is only a very rough approximation, and might not be very stable, and includes a lot of different overheads.

To get very stable numbers, this might require instrumenting your program with a little bit of extra code. The kernel launch time needs to be measured carefully, ideally using CUDA stream events placed before and after the OptiX launch. (Less ideal, but probably adequate is to start the timer, launch OptiX, and then synchronize before stopping the timer - meaning call cudaStreamSynchronize() or cudaDeviceSynchronize().) You will also need a count of the number of rays that you cast. This can be trivial to calculate, if you are casting only primary rays, or it may require some code to count the rays if you are casting reflection, refraction or shadow rays.

Be aware that counting optixTrace() calls can affect performance. When I do this, I normally make my simulation repeatable and I compile the shaders twice, once with ray counting enabled (via an atomic counter), and once with ray counting disable. Then I count the rays and time the performance with separate launches using the ray-count enabled and disabled shaders respectively. This takes a little bit of effort, of course, but can give much better & more stable benchmark measurements than other methods.

I also recommend locking your GPU clocks while you measure performance. Otherwise, you might get thermal throttling, which means the clock speed will slow down and make timings difficult to reproduce. I usually do this in a script that calls nvidia-smi. See the -lgc and -rgc options.

It is best to leverage the RTX hardware, so yes that does mean using the built-in triangle primitive, rather than using any custom software intersectors. With a Titan RTX GPU, if you are getting less than 100 million rays per second with simple geometry & simple shaders, then something is probably very wrong. If you are getting more than that but less than 1 billion rays per second, that might indicate plenty of room for optimization. If you are getting more than 3-5 billion rays per second then you might be achieving very high utilization already.


David.

PS Note that I’m talking about very simple scenes and shading above. With complex scenes and complex shading, numbers below 100 million rays per second are fairly common and would not indicate a problem. There are many things that can reduce performance, such as using any-hit shaders indiscriminately, casting shadows & reflection rays, accessing a lot of memory in your closest hit shader, etc…


David.

Ok great, thank you very much for your recommendations. I don’t know how many rays per second I get, but It’s great suggestion to check it.
Thanks,
Ami

Once you have those numbers, feel free to post again and we can discuss how to potentially increase them. We can even discuss that now, I think we just want to mostly ignore the nvtop datapoint because it might be fairly misleading.


David.

I appreciate that, I will certainly do that and get back to you.

I just have a separate question, I’m just trying to really understand how Optix works, my understanding predicted in chapter 4 of the Optix manual (https://raytracing-docs.nvidia.com/optix6/guide_6_5/index.html#programs#optix-program-objects): Optix takes the geometry and creates a tetrahedral mesh (as it is noted in the documents as a set of triangles) over the identified geometry. The purpose of mesh is to create a set of triangles and nodes that the material properties and background can then be attributed to them. Ray tracing algorithm will raster through the discretized tetrahedral mesh and change the attribution at each mesh point (e.g absorbed, reflected, etc.) and depending upon the material properties at each mesh point the ray tracing is simulated, and this process occurs for many cycles over and over again for all rays. Therefore, the parallelization is always bounded by how the mesh is defined over the geometry. I think the bulk of Optix parallelization work is done on dividing the rays into computational threads/blocks as the trace over the mesh is a sequential process and would not be easily parallelized. Is my understanding correct?! if that is, in Optix, how much parallelization/efficiency is done for a trace of one ray through the material? do we have data that show how much efficiency we get just simulating a single ray?

I know this is a side equation, but I would appreciate any inside.

The main parallelization in OptiX is achieved by rendering each pixel (or sample) separately for each thread. How you divide the work is up to you. You can use pixel or thread indexing to put larger or smaller amounts of work in each thread. Because the RTX GPUs have thousands of thread cores, you can render thousands of pixels concurrently. Generally speaking, by default there is little parallelization of a single ray.

A few minor clarifications-

The render time parallelization has almost nothing to do with your geometry, it is bounded only by your GPUs cores.

OptiX uses Acceleration Structures to speed up the process of intersecting a ray with a mesh and/or scene. This kind of acceleration is not parallel, it’s a sequential single-thread operation to search the data structure.

OptiX does not use tetrahedral meshes, the meshes are just surface triangle meshes. The mesh structure is really built completely by you, and you just pass it to OptiX. What OptiX does with the mesh is build the acceleration structure, and then help you test for ray intersections with the mesh by using the acceleration structure.

Ray tracing doesn’t use raster, and is different from rasterization. The core of the algorithm tests a ray against a scene and produces a hit or miss, and the hit comes with information identifying which triangle or other primitive was hit, and where on the surface the hit occurred. OptiX helps you invoke a callback for this hit or miss so you can determine what color or other information the ray test found. We like to refer to these callbacks as “programs” or “shaders”.

Does that help? You’ll find more explanatory information in the OptiX Programming Guide as well as the pinned threads on the forum. The blog post How to Get Started with OptiX 7 has a nice introduction, and you can find many past videos from GTC and Siggraph in the Nvidia On Demand video library.


David.

I should clarify that RTX hardware does parallelize compute work (like shaders) against ray tracing work (like ray intersection tests against meshes), so at the single-ray level, you are getting speedups that are unavailable in a software ray tracer.

Still, the primary parallelization factor on the GPU comes from the ability to do many many rays in parallel, not from intra-ray work. This is not specific to OptiX, it’s the basic way that GPUs and (for example) CUDA programs operate. The speed comes from executing many many threads simultaneously, and from the SIMD architecture that allows a single instruction to operate across multiple threads at a time.


David.

great, thank you very much, it is clear to me what is happening now-- we just deal with the surface mesh.

– I thought that we need to have a tetrahedral mesh in case we have an object material with light infusion properties

Ami

You can use a tetrahedral mesh if you need for your simulation - there’s no reason you can’t. But that’s not something OptiX does specifically. OptiX doesn’t make assumptions about what kind of mesh you give it, or about what kind of data is transported along a ray, or about what kind of tracing or rendering algorithm you’re using. OptiX tries to be agnostic and support a wide variety of ray tracing scenarios, whether it’s for scientific simulation or pretty pictures. If you’re referring to Subsurface Scattering, that is something I think typically done purely with surface meshes in film (entertainment) renderers. But a scientific simulation for subsurface particles might use a tetrahedral mesh, and OptiX will happily build an acceleration structure for a tetrahedral mesh to help you with ray queries, if you wish.

I also forgot to mention that building the mesh acceleration structure is done in parallel using the GPU’s many thread cores. This is done separately before rendering. Maybe you already know that, but just in case it’s helpful - there are multiple phases of parallelization and some of them don’t involve rays.


David.

sounds good, very helpful. I appreciate your help