How do I avoiding hitting the same triangle when calling tracing another ray?

If you call optixTraceRay after hitting a triangle to trace another ray from the interesection location, there is a chance of hitting the same triangle unless you use a large epsilon (like 1e-5). However, I can’t afford to use such a large epsilon. It would be much better if I could somehow mask out triangle that was just hit. Is this possible?

Hi @atm008,

There are a couple of options for solving this problem. One is to use a more precise epsilon value, and we provide a solution to help with that. Take a look at the pinned thread here: Solving Self-Intersection Artifacts in OptiX

You could also do what you suggest, and mask out the hit triangle. The way to achieve this is to use an any-hit program and to record the primitive ID into your ray payload when you hit something with a primary ray. Then in your any-hit program for secondary ray types, you can optionally reject hits that match the recorded primitive ID.

Be aware that the any-hit solution is likely to compromise your traversal performance. Use of any-hit is not advised unless absolutely necessary because it incurs high traversal overheads that have to interrupt the hardware accelerated traversal. We recommend first explicitly disabling any-hit programs, and trying the self intersection avoidance code, which is good at minimizing the epsilon value needed, and see if the results are acceptable.

If you can guarantee that your fp32-rounded ray origin is on the “outside” of the triangle and that your secondary ray’s direction points away from the triangle, then you may be able to avoid having any epsilon value. This is not typically very easy to do, but may be an option if you know how, and the hardware accelerated triangle intersector will not return a hit with t exactly equal to zero.


1 Like

Thanks David.

I tried the anyhit approach and was able to get very good accuracy. However, I am trying to trace a ray backwards the same way as forwards and now the backwards pass is hitting more triangles than the forward direction. One potential cause of this would be some kind of built in epsilon that would make the trace ray function directional. Do you know if this is the case?

Any advice on how to perform this in general? My triangles can best be described as triangle soup, which is why I ended up accepting the performance hit of the anyhit shader.


Hi Alex,

I’m not sure I understand the setup of the forwards & backwards passes, and exactly what the problem is. Could you elaborate a bit? Are you saying that you’re tracing pairs of rays with opposite direction, from randomly chosen scene points, and expecting to see approximatelythe same number of anyhit invocations statistically? How large is the discrepancy between forwards and backwards? How are the ray origins chosen, exactly? Do you have backface culling enabled?

There is no hidden epsilon in OptiX or anything internal that would make tracing directionally dependent, if that’s what you’re asking.


1 Like

In the forward pass, I trace a ray through the scene from point a to point b. Then, in the backwards pass, I trace the ray from point b to point a.

I don’t believe I have backface culling enabled, but how would I check? Is that an option passed to Optix?

Also, I have another question. Is there a way I can release the GAS without destroying the context? It seems like the context takes a while to create, so it would be nice if I could reuse it.
I’m repeated creating a new GAS and I’m wondering how to do things efficiently.


I think you would know if you had back-face culling enabled, it’s an opt-in feature. You can look for “BACK_FAC*” in optix_types.h for the various enums that you can use.

So I’m not sure why you’re seeing asymmetric results between your forward & backward passes. It could be precision issues, i.e., launching from point a doesn’t land at exactly point b, and vice-versa. With the new anyhit formulation, perhaps it could be that you are recording some self-intersections with t ~= 0, or it could be that you’re not recording all the destination hits at point b. Those might be the first things to check, and we can help brainstorm other possibilities if those don’t help.

Your GAS isn’t held state in OptiX, it’s existence is managed entirely on the app side, so you can release it at any time (as long as a kernel using it is not mid-execution). The GAS buffer is reserved via cudaMalloc() or something similar, and your use of the GAS is controlled explicitly on your end by passing it to an Instance-AS build input, or passing the GAS handle to optixTrace(). Once you’re done doing those things, you are free to swap it out for another GAS.

In terms of efficiency for repeated launches, note that best practices would include reusing allocations for anything you’d like to rebuild. This could mean allocating a larger GAS buffer than you need the first time so that when you go to rebuild it, there’s always enough space, and you can skip over the cycle of cudaFree() followed by cudaMalloc(). Or you can get even fancier and write some kind of allocator class to manage your CUDA memory allocations and reuse them whenever possible. We like to use ring buffers for this kind of thing and recommend that users do the same.


1 Like

Upon further investigation, it looks like the results between the forward and backward pass are probably close enough, even without the anyhit shader. I had a bug somewhere else that was causing the sensitivity.

As for the repeated launches, I’m rebuilding the GAS for 800k triangles that are very close to each other. I’m getting some kind of operation that takes 300 ms, but I’m not sure what it is. It takes 80 ms for 80k triangles, so I suspect the GAS build, but I can’t be sure. Any idea what to check first?

Oh good, glad the epsilon issues are resolved!

The first thing I do for curious timing bugs is fire up Nsight Systems. nsys profile can capture your run on the command line and show you where all the time is going. If things are async & confusing, then set CUDA_LAUNCH_BLOCKING=1 in your environment to get all the launches to be synchronous, so you can really see which one’s taking all the time.


1 Like

So I profiled the code and it says that the main issues are uncoalesced global accesses and long scoreboard stalls.

I have some global memory reads but I wasn’t expecting it to be this much. Is this OptiX?

Those 2 things are the most common bottleneck reasons for a ray tracing kernel’s performance. Coalescing is difficult to do with ray tracing. One possible solution is SER, if the uncoalesced memory access is a real and large bottleneck. Long scoreboard is synonymous with local/global memory access, and it doesn’t take a lot of memory access in a kernel when there’s 50k threads in flight for the memory system to be the limiting factor.

This screenshot is from Nsight Compute, however, not Nsight Systems. And you’re looking at the profile for a single specific kernel invocation … do you know which one it is, is it optixLaunch, or one of the optixAccelBuild kernels, or something else? There are several different “Nsight” tools. Nsight Compute is great when you want to deep-dive into a specific kernel and maybe optimize it. Nsight Systems is the tool for profiling the timings of many kernels to find out how long each one takes. If you’re still investigating which kernel or API call or host-side function is taking the unexpected 300 ms, then see what you get with Nsight Systems.


1 Like

nsys profile is taking a good chunk of time to run. Not sure if it is even executing code. I have some python + pytorch code I don’t want it to profile, but I don’t know of a way to avoid profiling that stuff.

On the other hand, CUDA_LAUNCH_BLOCKING=1 + timing various parts of the code was actually really helpful. I didn’t really process how much memory the GAS was consuming, so I didn’t register your previous comment about making sure I don’t keep reallocating the same memory. I was able to improve performance by quite a bit (3x, if not a bit more). Now the actual bottlenecks are the kernels accessing global memory.
Current numbers:
100 ms for GAS build
100 ms for forward pass
100 ms for backward pass

with 80 million tightly overlapping triangles, I’m not sure I can expect any better.

Ah, I know nsys profile likes to be launched against a binary executable; it may be the case that it’s not attaching to your process automatically, and just waiting around. Usually nsys profile doesn’t slow down my run, as it’s not doing any instruction sampling, so I wouldn’t expect the pytorch stuff to get in the way other than you might see a bunch of it in your output. You can also instruct Nsight Systems to wait to start profiling until a certain amount of elapsed time, and only profile for a certain amount of time. Poking through the various options or taking a look at the GUI interface for profiling might help you setup a fast profile.

Yes! Super glad there was something useful in there, 3x is excellent speedup for a day’s work. ;) Which GPU are you using? 800M tris/sec for BVH build sounds quite reasonable and decent. The lowest hanging fruit now may be trying to optimize the traversal kernels (… with Nsight Compute).


1 Like

Thanks for the help! It is indeed a great speed up. I’m using an RTX 6000. Any typical low hanging fruit in the traversal kernels? Or is optimizing the geometry for better traversal going to give the best performance? I’m wondering if there is anyway to store the global memory stuff in texture memory or something, but it already looks like that should be happening in the L1 cache.

It’s worth tallying your rays/sec metric for your forward & backward passes. That might be a reasonable litmus test as to whether you should go deep on optimization. There’s no correct threshold number, it depends completely on your scene, camera, shaders. If you have extremely small shader programs and you’re already getting single or double-digit billions of rays per second, then it’s pretty fast. If it’s under 1 gigaray/sec, then depending on how much compute is in your shader programs, some effort to try optimizing might be justified. If you have very heavy compute in your shaders, then tens or hundreds of megarays/second might actually be a reasonable upper limit given what you’re throwing at the GPU, however in this case you’re very likely to find significant optimization opportunities.

The most likely ways you can speed up your render passes are to reduce memory usage and/or to increase the coherence of memory accesses. I wouldn’t expect moving data to texture memory to help necessarily, and it can be a bit of a pain too. (But it’s hard to know exactly what will or won’t help without trying…)

If your raygen or hit shaders have any significant compute or complexity, check for register spilling to memory. OptiX will report spills in the compile info output.

You can use Nsight Compute to examine your compiled SASS assembly, and get line-by-line stats. It will show you which load and store instructions are causing the bottlenecks and help you associate those instructions with the corresponding line(s) of code. (Register spills are often visible in the SASS and show up as a batch of store instructions at the beginning of your function, and a corresponding batch of loads at the end).

To reduce memory usage, look at your ray payload and scrutinize every byte. If you can remove or compress anything, it may help. Keep in mind that it’s frequently, and sometimes surprisingly, faster to recompute derived values as needed than it is to cache them in memory.

If you’re using a pointer to local memory in your payload, consider using the optix{Get,Set}Payload* functions as they will try to keep the payload values in registers instead of using memory.

OptiX has an advanced mode for payload usage that is called ‘payload semantics’ where you can specify detailed info about which payload values are read or written in which programs. Sometimes this can allow OptiX to save registers.

If you can think of ways to cast rays or do your indexing so that your memory accesses are more likely to be close together, it may be worth experimenting with them. Otherwise, Shader Execution Reordering (or “SER”) is a way to bucket & sort ray batches based on which hit shaders will execute as well as spatial locality and any other custom bits you want to add to the sort key. SER is unlikely to help you unless you have a relatively significant divergence problem already, but if you do and you can think of ways to better sort your data, then it might be worth looking in to.


1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.