Are these reasonable numbers? RTX 3060, Optix 7, 128 billion rays in ~35 seconds

As in calls to optixTrace()

I have a feeling we’re hitting the theoretical limit here and was just wondering if it’s worth further trying to find ways to optimize. Shaving a couple of seconds off wouldn’t make much of a difference, cutting it in half or better would. Also we have better hardware, this is just to gauge from my laptop. Just looking for a gut feeling if this seems about right.

This is ~3.6 Grays/sec. Seems pretty reasonable to me for a 3060 and I guess would be hard to make it 2x faster, even if theoretically possible. It’s not bad in the sense that some of our SDK samples run lower than this on higher end GPUs. Is it a single kernel launch, or many launches over 35 seconds? Single sample per pixel, or multiple? Path tracing, or primary rays only, or something else? How big is the scene? Single level, two level, or higher?

I’d guess the peak ray limit is higher on 3060 (but I don’t know exactly what the limit is). Whether you can achieve higher depends on the details of your application, including the kernel size, payload size, memory bandwidth required, OptiX features used, renderer type, etc. To approach the theoretical limit on any GPU, you need big batch sizes, a small payload, coherent rays, simple shading, no any-hit, hardware triangles only, fast-math options enabled, as well as careful handling of any intermediate data, and the output buffer if you include any I/O in your timings.


Hi David,

So I am in the right ballpark and it’s more of a UX problem now, or something where we reduce the space we’re looking at.

I am not actually looking at pixels, but from inside rooms of whole city blocks checking visibility through windows looking for sky and planned buildings. That’s how I end up with a few hundred thousand times few hundred thousand launch index that I chop up into pieces small enough to fit into the 2 billion launch index restriction.
I do almost all of the things you mentioned, and where I can’t I am aware (will double check fast-math, haven’t revisited this and will see if we still hit our accuracy goals).

I was still concerned that I might be off be an order of magnitude or so, because that’s where I usually start implementing these. (usually a few orders off :))

Thanks a lot for taking the time and confirming my intuition here, I really appreciate it!


1 Like

Well you’re definitely within an order of magnitude, and my guess probably well within a factor of 2. If you haven’t tried it yet, I recommend running your kernel through Nsight Compute and seeing what it has to say about your memory and SM usage, and take a peek at where the hotspots are. You might have to shrink your kernel size temporarily to get it profiled, but anything over maybe a million threads should be enough to saturate the GPU and get accurate profiling.


The problem is that I have to work through Unity and haven’t found a way yet to get those debug tools to work.
But I also only spend limited time trying. So this might be a good opportunity to give it another stab. And I learned some new tricks since then as well. Good to know about the million threads to saturate the GPU!