Opt out of RT hardware

I’m comparing OptiX 7.0 results on a Quadro RTX 5000 and a Quadro M4000 and I find that I don’t get quite the same results. I have to say that my code is extremely sensitive to any kind of round-off differences. So, I guess there can be a slight difference in the computation of the triangle intersection points between having RT cores and not. Is this expected? Is it possible to make the RTX 5000 perform the OptiX intersection calculations without using the RT cores?

Is it possible to make the RTX 5000 perform the OptiX intersection calculations without using the RT cores?

That is not possible. Please read these explanations of a similar question and follow the links in them:

If you’re comparing the Quadro RTX 5000 and Quadro M4000 with respect to the intersection precision of the built-in triangle primitives, they should be very similar, but you cannot expect a perfect match between different hardware.
A single difference like using a fused-multiply-add and a code path with separate multiply and add can already result in different rounding.
From a performance standpoint your RTX board should always be faster than a three generations older board of a lesser level.

You could implement your own triangle intersection program, but then you lose all performance improvements from the hardware triangle intersection and need to call back to the streaming multiprocessors during BVH traversal. That’s considerably slower.
Also note that the built-in triangle intersection routine is watertight. The triangle intersection code provided in earlier OptiX SDK API versions is not.

Maybe explain why your rendering algorithm is so sensitive to the floating point accuracy.
Is you scene huge? Are nearby intersections very close to each other? Etc.

That’s what I see. Most of the time the results are identical, but a few rays will have 1 ulp differences.

The problem only arises in my regression tests where I expect very close agreement between devices and I’m too lazy to run enough rays to smear out the differences sufficiently. In normal use, the differences are not significant (certainly no worse than the usual FMA differences). I was trying to save myself some bother generating different gold standards.

Supplying a custom IS program is an interesting idea. I’ll look at that. For the most part, my regression tests are looking for correctness rather than performance.

Thanks again for the tips.

If this is just about an ULP difference in your image comparison tool, then you should rather change the comparison tool to work with a threshold or really have separate golden images you compare against per architecture. Guess what we do.

You shouldn’t seriously consider that given the expected performance loss on RTX boards. I was assuming this is about some scientific requirement.
You wouldn’t want to use that during testing if it’s not actually what is running in the end, and still then, the streaming multi-processors are different and code generation is not supposed to result in identical microcode among different GPU architectures either.