RTX triangles performance, any tips?

I worked through the performance section of the documentation and saw a few threads here but just wanted to confirm that these numbers are off/that there is room for optimization?

I am starting off with a grid of 500x500 positions shooting straight down. Depending on what they hit in 70% of the cases the ray just dies. In the remaining 30% I spawn another 50k rays in a cone. The lowest y position of those 50k is what I then report back.

The scene has roughly 400k vertices 130k triangles, 3 different materials, the structure is as flat as possible. (However the ~250 input meshes are each their own geometryInstance. I tried merging them all into one per material, but the effect was minimal)

The payload data is also minimal, just a struct with one float4.

With RTX on and using the triangles functions this takes between 6 and 10 seconds.

All in all this results in ~170 million rays / 6 = 28 mil rays per second on a 2080 RTX, which seems low to me.

I read in another thread that I should try getting the amount of launch indexes up, so I am trying to do that.

Any other ideas what I could try? (I also made sure ray types are properly set, max depth, there are no erros etc.)

Hi @tjaenichen,

I agree this sounds very suspiciously slow. I bet we can speed it up.

I’m not sure I understand your setup, can you elaborate? Specifically, you said you’re shooting 250k rays in your raygen shader and 70% of them hit nothing and call your miss shader? Then what does it mean “I spawn another 50k rays in a cone.”? Are you spawning 50k secondary rays for every single hit point? And do you trace all 50k rays in the same thread as the primary ray’s hit?

I also want to understand the numbers… 30% of 250k rays would be 75k hit points. If you spawn 50k rays for each hit point, you should end up with something like 3.75 billion secondary rays? Why did you end up with a much smaller number of 170 million, did I misunderstand what you’re doing?

Right from the start, your setup sounds very similar to another topic here https://devtalk.nvidia.com/default/topic/1051879/optix/comparing-optix-performance-to-cuda/

If you have some threads that only send 1 ray, and some threads that send 50k+1 rays, the huge discrepancy is going to cost a lot of time. I’d recommend reading through that topic carefully to understand how to reduce your secondary ray workload down to 1 or a small number of rays cast per thread. What this most likely means is doing two separate launches rather than a single launch, one to send primary rays and capture hit points, and another to send your batches of secondary rays. More details on how to structure that approach are in that topic.


Hi David,

thanks a lot for having a look at this. After working on this for a while now I totally missed the obvious. You’re absolutely right. It’s more in the range of 3+ billion rays.

I already saw a thread on the whole issue of having not too many secondary rays, but I will give the linked thread a good read.

But by the looks of it, I won’t get the “order of magnitude” increase I was hoping for, so I guess I need to keep the users entertained otherwise while this runs :)


Whew, good to know OptiX wasn’t going that slow. So then if my arithmetic is anywhere close to what you have, your total number of primary+secondary rays is about 4 billion, and it’s taking 6 seconds, so perf might be around 0.66 gigarays per second? That’s still a bit slow for RTX hardware and trivial shading. In the topic I linked to, @afatourechi was able to realize a 3x speedup by normalizing the workload to one ray per thread, I would guess you might see a similar result by restructuring your launch.

You mentioned you have different meshes and materials, which is probably fine, and it sounds like merging them didn’t help. Is your scene a 2-level hierarchy, meaning each mesh/instance in an acceleration structure with only one acceleration structure of instances at the root node?

Another question I had is whether your secondary rays need the closest hit position, or whether they are more like shadow rays? You mentioned reporting the minimum Y value, so I assume it does matter. However, if it doesn’t matter, you might want to use the ray flags or geometry flags to disable anyhit shaders. This needs to be done even if you don’t have an anyhit shader.


To be perfectly honest, I checked my code and I did use a regular int to calculate the number of rays. I guess you know where this is going :) There was also some addition in there and it somehow ended up at a “not to unreasonable” value instead of minus something.

I also just had a pretty good idea on how to eliminate more secondary rays by first casting a smaller number and working from that before I cast the full 50k.

As for the hierarchy, yes. It’s a very basic one. One top group, one group for each of my 3 different types of meshes, one geometryGroup for each and then all the instances for each object. I could remove that one level, but I didn’t think it would matter. (Also now that I know the actual raycount I might revisit the mesh combining)

As for the any vs. closest, the do need the closest hit. Going through the documentation I saw the options to disable any_hit but didn’t think it would matter enough performance wise. I will give this a shot too! (right now they just aren’t there and I don’t declare them)

Again, thanks a lot!

I also just had a pretty good idea on how to eliminate more secondary rays by first casting a smaller number and working from that before I cast the full 50k.

Reducing the ray count will definitely help, if you’re casting more than you need. Just remember that because threads run in groups (warps) of 32 together, if one thread casts more rays than another thread in the same warp, you pay the cost of the maximum number of rays for both threads (or more generally all the threads in that warp). This is why reorganizing to only cast a few rays per thread can have such a dramatic difference in the run time. When the rays per thread are very different from thread to thread, the GPU warp is spending a lot of time working on that last single active thread, when the warp could be doing many in parallel.

I could remove that one level, but I didn’t think it would matter.

You’re right, it doesn’t matter, you get 2 level scene traversal in RT Core hardware. Flattening it down to 1 level won’t help.


I couldn’t help and late last night implemented a quick and dirty version of the ray reduction and the impact was massive.

Also thanks for the explanation with the GPU warps. I saw the comments on threads and primary rays, but with this explanation it makes a lot more sense.

Thankfully my secondary rays are clustered, meaning if one primary ray is invalid chances are high that their neighbors are as well. If the warps are within neighboring launch indices this might not be a big enough problem to warrant a second launch. (From reading the forums before my post I figured that to reduce rays this might be the best way and started working on that. However it’s a bigger refactoring, so for now I’ll look and see if business is happy)

Also do you have any recommended reading on caveats when we try to bring this into the cloud and on pro grade hardware? (Tesla I think it is)

Yes!! So glad I could help.

For cloud rendering, we don’t have anything official yet that we can offer. There has been some discussion here in the forum recently about using AWS and docker and things like that, just in case you haven’t seen those threads, where other forum users have been more helpful than I can be. The main hurdles seem to be just getting builds & drivers working on cloud machines and/or in virtual environment. AWS for example doesn’t offer very recent drivers nor current generation hardware on their GPU instances.

It might help to start a new thread about cloud questions, especially if you can elaborate on your specific environment & requirements. There might be people who can help with cloud questions but skip over this thread due to the title…


Thanks a lot for the insights. I’ll proceed carefully then with cloud based solutions and have a read in the forums first.

Currently I am stuck though with a new issue, just didn’t want to make a new thread for it.

Everything runs fine on my machine (using Unity and a 2080 RTX), but it fails on site at my clients. The two machines that can’t run it both have a 980 TI. I am using the geometryTriangles strategy, so RTX is a must, but my understanding is that this should still work (slower) with a 980/TI.

I tried different configurations and even a small sample scene that is basically empty. This also works on my machine, but fails at my clients.

The error messages I am getting are, for the prod solution:

OptixException: Optix context error : Unknown error (Details: Function “_rtContextLaunch2D” caught exception: Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (719): Launch failed)

and for the test scene

OptixException: Optix context error : Unknown error (Details: Function “_rtContextLaunch1D” caught exception: Encountered a rtcore error: m_exports->rtcDeviceContextCreateForCUDA( context, properties, devctx ) returned (2): Invalid device context)

I know that they remote into at least one of the machines, so I don’t know if team viewer has anything to do with it. The computing capabilities of the 1 device returns 5 and 2.