# Profiling optix

I have this issue:

when I double the number of rays shot and double the number of primitives, all other parameters unchanged, optix , the ray trancing part, not the GAS preparation, takes 10 times slower.

I’m going to use nsight compute to profile optix, but any idea why? is the underlying algorithm not linear?

Hi @huyleq1989,

The OptiX / RTX traversal algorithm is not linear in the number of primitives, it is sub-linear. This means that if you double the number of primitives, then under normal circumstances, the time taken will be less than double (usually much less). The complexity in number of rays is entirely under your control, and will be linear unless you have made an algorithm that somehow depends on the number of rays.

We can’t know what’s causing the slowdown without more information, so looking at Nsight Compute might be a good starting place. Maybe your algorithm is divergent, or maybe you’re losing cache coherence, or maybe there’s something wrong with your geometry or rays. Have you tried doubling the rays and doubling the geometry separately? What are the separate slowdown factors for each?

David.

1 Like

Hi David,

Thank you for your informative answer. To answer one of your question, the doubling of number of vertices is due to higher-resolution sampling of the (more of less) same shapes.

I found out that the slow-down is due to the use of acos function, kind of, which I use to compute the rays’ angles.

So my rays’ angles have this form:

ray ith’s angle = theta + i*alpha, and i’m given cos(theta), sin(theta), and alpha.

So I lazily and naively use acos to compute theta from cos(theta) like this

theta = acos(cos(theta)) if sin(theta) >= 0 and -acos(cos(theta)) otherwise.

After that I compute ray ith’s angle from the above formula and then cos(angle), sin(angle) to give to optix trace.

This works fine for the low-resolution case but slow down (10x) in the high-resolution case when number of primitives doubles. Then I realize that I can compute my ray’s angle without acos, using a rotation of v = (cos(theta), sin(theta)) by angle i*alpha. Then the high-resolution case is not slow anymore.

I dont suspect anything special about the implementations of acos, sin, and cos, (is there?), so the fact that my old method (using acos) works fine for the low-resolution shapes but somehow slow down with higher-resolution says that something else is wrong with my input, ie cos(theta), sin(theta).

Also, in the process of profiling, I time the other parts too, like initialization of optix contex, building gas, creating modules, creating program groups, linking pipelines, and setting sbt. I notice that initialization of the context, building gas, and creating modules are most time consuming. Interestingly, their timings change (get faster), as I run my program many times. What’s the explanation for this? Is creating modules where/when optix kernels are compiled by JIT compilation?

I dont suspect anything special about the implementations of acos, sin, and cos, (is there?)

You are not using the float variants of the trigonometric functions (acosf(), sinf(), cosf(), etc.).
Are you doing calculations with double floating point precision?

There is a huge performance difference between float and double precision calculations on most GPUs.
This is GPU dependent and can be queried via
cuDeviceGetAttributes with CU_DEVICE_ATTRIBUTE_SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO or
cudaGetDeviceProperties in the singleToDoublePrecisionPerfRatio field.

In addition to the double vs float performance differences, the CUDA compiler will generate a lot of code for the high precision calculations of trigonometric functions, reciprocals and square roots in either double or floating point precision.
This can bloat OptiX device code a lot and reduce the runtime performance considerably.
That is why OptiX device code is usually translated from CUDA to PTX or OptiX-IR with the CUDA compiler command line option `--use_fast_math`.
That lets the compiler generate much faster approximation instructions for the trigonometric, reciprocal and square root functions.
You’ll see that as `.approx` suffix inside the PTX source code when comparing builds with and without `--use_fast_math`.

To summarize: Avoid doubles, use the explicit float variants of functions (acosf(), sqrtf() etc.), compile with --use_fast_math for best floating point performance calculations.

On the topic’s main title: If you’re using Nsight Compute to profile your OptiX device code, you should be able to identify where the time is spent exactly.

Also, in the process of profiling, I time the other parts too, like initialization of optix contex, building gas, creating modules, creating program groups, linking pipelines, and setting sbt. I notice that initialization of the context, building gas, and creating modules are most time consuming. Interestingly, their timings change (get faster), as I run my program many times. What’s the explanation for this? Is creating modules where/when optix kernels are compiled by JIT compilation?

OptiX and CUDA cache device code on disk by default.
If you run the same application with the same device code multiple times, the OptiX compilation step will get a lot faster in the subsequent runs.

Find the different optixDeviceContext functions for the Cache here:
https://raytracing-docs.nvidia.com/optix7/guide/index.html#context#context

You can see that OptiX cache mechanism in action when setting the optixDeviceContextSetLogCallback and use the highest level 4.

ah sorry for the confusion. i used cosf, sinf, and acosf. so single precision.

Ah, that’s good. So what did you see in the Nsight profile?

One thing to know about trig functions that may or may not explain what you’re seeing is that they use a different pipline than the linear math instructions like add, mul, and fma. The trig functions have a higher latency and lower throughput than linear math, which is why they have a reputation for being a little slow. But if you have a mix of trig functions and linear math, then you can sometimes do them in parallel, as long as you have enough independent computations to fill both pipes. This means that in some situations, the latency and/or throughput of trig instructions can be hidden such that they appear to be free, while in other situations, they can bottleneck and slow your code dramatically. Maybe your doubling of the rays & geometry is causing your code to flip from one state to the other. This might even be happening if you have lower divergence when you double the rays, which could magnify the effect of a bottleneck on the trig functions. Lower divergence is normally a very good thing, but sometimes it can reveal bottlenecks too.

Another possible explanation here is that you have fallen off a cache cliff with the increase in geometry and the increase in rays. If the doubling put more pressure on the cache and causes more cache misses, then your performance can drop dramatically. Take a look at the cache hit/miss rates in your Nsight profiles and see if they’re changing when you double your inputs.

It’s still worth separately doubling the rays & geometry, if you want to truly understand what’s going on. I would suggest doing 4 Nsight Compute profiles and then loading all 4 and using the baseline feature to compare your 3 new profiles (double rays, double geom, double rays & geom) against the reference profile (single rays & geom), and looking through all the metrics to see what differences stand out.

David.

for my problem, 180 rays are shot at every vertex, so number of ray origins equals number of primitives. it’s hard for me to just double either geometry or rays alone.

anyway, when both are doubled, nsight computes shows

“low compute throughput and memory bandwidth utilization relative to the peak performance”
“The kernel achieved 8% of this device’s fp32 peak performance”
" Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 3.4 cycles… Out of the maximum of 16 warps per scheduler, this kernel allocates an average of1.79 active warps per scheduler, but only an average of 0.39 warps were eligible per cycle."
“The 4.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the hardware maximum of 16.”
On average, each warp of this kernel spends 1.9 cycles being stalled on a fixed latency execution dependency. This represents about 30.3% of the total average of 6.1 cycles between issuing two instructions."
“This kernel’s theoretical occupancy (25.0%) is limited by the required amount of shared memory This kernel’s theoretical occupancy (25.0%) is limited by the number of required registers The difference between calculated theoretical (25.0%) and measured achieved occupancy (9.8%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution.”
“This kernel has uncoalesced global accesses resulting in a total of 32320389 excessive sectors (17% of the total 187681118 sectors).”

But now the ray tracing part is not the bottleneck. Building gas and creating modules combined are 4 time more expensive.

btw, log call back shows something like

Opened database: “/var/tmp/…”
Cache data size: “4 MiB”
Cache miss for key: …
Pipeline parameter “params” size is 160 bytes

Cache hit for key: …

Is that /var/tmp/… the program cache file? What are those cache hit and miss?

I see, okay well the next best thing is to look at your 2 profiles (single vs double) and make the ‘single’ profile your baseline, then look at the relative differences in the ‘double’ profile.

Is that /var/tmp/… the program cache file? What are those cache hit and miss?

Yes, it’s showing you the path to the cache file, which is a database of key/value pairs. The key is a hash of your code plus all relevant configuration like OptiX and driver version (anything that should trigger a recompile if you change). The value is, of course, your compiled code. You can safely delete this file, if you like, which will force OptiX to recompile the next time you run. The “Cache miss for key” message is letting you know that OptiX had to recompile your code. If it says “Cache hit for key” that means it found your compiled code in the cache, and loaded it out of the cache instead of recompiling.

For rendering performance, I’d recommend measuring the number of rays per second that you achieve, and optimizing your rendering time, your GAS build time, and your compile times all separately. I don’t know what it means for the GAS build and module creation to be 4 times more expensive than rendering. Module creation is considered compile time, and may only need to happen once per run of your application. Module build is subject to the OptiX cache (so it’s slow the first time after you’ve made a change that needs to be compiled, and then faster on subsequent runs as long as you don’t change anything.) GAS build is BVH creation and may or may not need to happen every frame that you render, but typically is done at a lower rate or less often than rendering, in some sense. Cost of rendering depends on how many rays / pixel samples / frames / etc. that you need, and changing these things doesn’t affect your GAS or module builds.

If you would like us to peek at your profile and do a quick analysis and discuss what we see, I’d be happy to, and I think other people here may be interested to see it as well if you’re okay with discussing those details. If you want to take me up on the offer to read your profiles but do not wish to discuss it publicly, you are welcome to PM me with your data confidentially.

David.

about the cache hit/miss, when running the same program multiple times, sometimes i saw 2 cache hits and sometimes 1 hit and 1 miss, with 2 different keys (i only have 1 optix program). why is that?

“I don’t know what it means for the GAS build and module creation to be 4 times more expensive than rendering.” About this, what i meant is that i use a hybrid cpu/gpu timer to measure the time it takes to build modules and gas and see that it takes 4 times more than the ray tracing part. i do see this gets faster the next times i run.

unfortunately, due to company’s rules, i wont be able to share much of actual performance numbers. about my program, i dont render any image, screen, or frame. i want to compute the lines of sights, which is defined as how far an observer can see along a certain direction, like this

I want to do this for a list of observers and multiple directions per observer.

about the cache hit/miss, when running the same program multiple times, sometimes i saw 2 cache hits and sometimes 1 hit and 1 miss, with 2 different keys (i only have 1 optix program). why is that?

You’ll see cache hits when you have a key that is the same as one you’ve used before. I mentioned that the key changes whenever you modify your code to be compiled. So if you do any dynamic code injection, or if you use any `#define` macros that change, then your key will change, and then you’ll have a cache miss & recompile. Another thing that changes your key is your compile options. So if you have debugging turned on and compile, then turn debugging off, you will get a new key / cache miss / recompile. The compile options extend to all of your OptiX module and pipeline options, because these things affect what code is compiled. If you change trace depth, toggle motion blur, use a different number of registers in your pipeline, change your payload semantics, or change any bound values… all of these things are considered compile options that can result in getting a different hash key when it comes to compile time. So, the bottom line is that a cache hit is evidence that nothing about your code & options has changed from a previous run, and a cache miss is evidence that either something did change, or that you never compiled with the current configuration before, or it can also happen that your compiled code was evicted from your cache file if the file got too large or if the configuration you’re using hasn’t been used for a long time.

unfortunately, due to company’s rules, i wont be able to share much of actual performance numbers

No problem, I just thought I would offer. I don’t entirely understand your setup exactly even after looking at the image, but it does seem similar to some “baking” pipelines that people use - it’s common to not have a camera but instead use ray traced sampling to fill values in some kind of a texture map or vertex map over a mesh or scene, or to have some kind of a sensor or receiver that isn’t in the shape of a plane. It’s not necessarily important for me to understand the exact setup, but sometimes the setup is involved in performance issues, so it’s nice to be able to either implicate or rule out whether the geometry is part of the slowdown. It sounds like in this case, perhaps not. With the compile cache issues and `acos` problem discussed, are there any remaining mysteries about where time is being spent?

David.

I think it’s good for now. I’ll ask more questions if something else comes up. Thank you very much David.

1 Like