Ray intersection without closed-form for scientific application

Hey all,

Lurking on this forum has been a wealth of knowledge learning about OptiX - time for me to ask my own questions! I have a few, so let me try to break down my application:

I have a need to intersect large volumes (100,000 to 10,000,000) of rays with complex surfaces to very high precision. I have a very low number of surfaces in my scene (<100). The surfaces are defined with complex spline shapes, and thus there is not a closed-form intersection algorithm. In my current CPU version I iteratively solve for a zero-crossing of my ray parametrization (ray distance T) and my parametric surface.

This is not a real-time application, but I’m hoping to see better throughput in GPU than CPU. I’d like to see if I can leverage OptiX, but I’m not quite sure what the right formulation is. Is there prior art for solving an intersection like this on a per-ray basis? I’m worried about significant performance degradation due to iterating within the intersection program.

Another option I’ve considered is to transform my surface into an approximate triangular mesh, intersect with the mesh, and then refine the intersection on the surface either in CPU or GPU. This would hopefully allow good convergence with a small, fixed number of zero-search iterations. I’ve yet to quantify the performance delta here.

Question 1: Any advice on how to approach formulating this (admittedly nonstandard) problem? Or am I barking up the wrong tree with OptiX?

Question 2: I’ve demonstrated that single-precision float is insufficient for the final output for this application. Prior posts have indicated that while double precision math is possible in CUDA programs for intersection, etc, the ray formulation is always single precision. Can I circumvent this by adding a double precision direction and origin to the arbitrary Ray data, and updating those separately from the single-precision value (perhaps in a refinement stage in the mesh approximation I proposed above)?

Guess it’s worth mentioning, we’re also happy to consider a CUDA implementation outside of OptiX if that’s your guidance.

Hi there, welcome!

Solving the parametric equations for ray-surface intersection with splines is pretty common, I think you’re in the right place. Lots of people use long, complicated, fancy intersection programs. :)

You’re right to worry about the performance of an iterative solver, but it’s not necessarily because it’s iterative, but because each thread tends to need a different number of iterations. The main issue is that due to multiple rays working in parallel, some of them will finish in just a few iterations, and some of them will continue working and take many iterations. If you can figure out any ways to keep the number of iterations close to the same for all the threads in your warp (which is a group of 32 threads), then you will see bigger speedups, but if the probability of a long tail for a single thread is greater than 1 / #-active-threads-per-warp, then performance could suffer.

You are still likely to see better throughput on the GPU over the CPU even if your program isn’t well tuned, the sheer number of math cores will generally churn through rays faster than a CPU can. From our commercial users we’re often hearing numbers in the range of 20x to 50x perf improvement using GPU over CPU.

If having a good initial guess will help you reduce the number of iterations on average, then it’s very likely that intersecting a tessellated mesh first is a good idea, and then you can re-launch and search for roots numerically using those results as your starting place. In OptiX, you could do that using the geometryTriangles part of our API, so you don’t need to write the intersection program, and if you have an RTX GPU, the intersection program will be further hardware accelerated.

So the answer to Question 1 is this might be more standard than you think, it sounds doable in OptiX based on your description.

To answer Question 2:

The short answer is you are free to ignore the OptiX rays and use your own buffers of double precision data. It will cost extra memory bandwidth, but there’s no reason you can’t do it.

Since the cost might be high, I’d be curious to at least explore the bounds of what you need and see if there are ways to not require doubles for the entire pipeline.

Your output needs to be double, but do your inputs also require that precision? Does your solver require doubles for stability? Are you using transform nodes to place your geometry, or is it pre-transformed into world space? Have you compared the accuracy of your solver’s parametric results versus the output hit point? Sometimes the solver is fine in float, and the precision loss happens when converting the answer to a world space hit position. Sometimes there are ways to improve the precision of the output.


David.

Hi David,

I’d be happy to discuss more detail about our application and our (perceived) requirements for double precision - is it possible to follow up offline on the technical details of our application? Otherwise, I can try to provide some general feedback (see below). I’m also outfitting our CPU implementation to return metadata on the distribution of iterative steps for intersection convergence. I definitely appreciate your point of long-running threads in the same warp - the triangle mesh initialization would indeed be an effort to reduce that possibility.

In short: we’re planning on recursing rays deeply between these objects (through recursive calls to rtTrace() and/or re-launches), and small errors in calculated normals and ray directions can compound significantly through reflection and refraction. There’s still work to do on our side to improve precision in these operations (we’re not FMA-optimal yet) but there is certainly a noise floor that we’re approaching after many recursions.

The other area we’re pretty sensitive to is recursively-calculated ray length, though this hasn’t been as painful as normals/directions in our experience.

If you have any advice from your experience on squeezing out a little more precision without moving to double I’m all ears.

Yes, the preferred way to discuss confidential project details would be via the optix-help mailing list, that way others on the team are more likely to chime in, and we’ll keep it internal. You are also welcome to PM me directly if you prefer.

I asked about using node transforms just to check whether your intersection calculations are happening in a local space that is near the origin, to make sure you’re not losing precision unnecessarily.

Aside from that, make sure you’re using fake iterative recursion rather than real recursion. In OptiX this usually means having a loop in your raygen program, rather than casting rays from your closest hit shader. See the optixPathTracer sample for an example.

For more specific advice, we’d have to dig into your setup & solver details a little. You might also just dive in a try it with full doubles and see if the perf you get is acceptable in your case. The double math is generally speaking twice as slow as floats, and memory read & writes will also be roughly twice as slow. One concern is just that those two things can compound to produce a result that is more than twice as slow. If you’re not worried about a 2x or 3x at the moment, it might be worth prototyping to get a more accurate sense than we can provide, and then address how to profile and improve it if the perf isn’t as good as expected.

Just a minor note that you’ll still need the float precision rays in order to let OptiX do it’s traversal. That does happen in float precision, but the bounds in the acceleration structure can tolerate 1 bit of float overreach, so you’ll get accurate traversal out of the float rays. You can additionally pad your bounds for double-to-float rounding error if you want, but you shouldn’t need it.


David.

Thanks David. I’m going to go and try a few permutations on this based on your feedback, and see what falls out of it. The best way to learn is doing! I’ll report back on progress and I’m sure I’ll have additional questions.

Sounds good! I should add that it was pointed out to me internally that my statement about doubles being twice as slow to compute is perhaps optimistic. That’s true for some of our higher end workstation GPUs but some of the game oriented GPUs just aren’t geared for double precision compute. So it varies from between architectures and models. This doesn’t change the advice, only maybe the expectation. Depending on which GPUs you’re using or need to support, there could end up being stronger motivation to minimize the time spent doing double math than I let on above.


David.

Hi David,

Thanks for the feedback! For now our tool is purely internal, so we can run it on hardware as we see fit. We have a GTX970, RTX2080, and P2000 to test on internally, plus the option of moving to a V100 in AWS. I hope that’s a good spread of GPUs to explore implementation tradeoffs.

I’m going to take some time to get a first implementation, then I’ll reach out here or directly with some timing data and potential bottlenecks. I’ve seen that profiing may not quite be ready yet ([url]Optix profiling using Nsight - OptiX - NVIDIA Developer Forums) - any guidance on how to get an early read of what makes the most sense (or is the best option for now just wrapping calls in timers?

Sorry for the delay answering your profiling question.

Good news, at Siggraph we announced that the public release of OptiX 7 is coming with profiling (and debugging!) capability in NSight Compute, NSight Visual Studio Edition, and cuda-gdb. BTW these will also work with OptiX 6.

We’ll announce it here on the forum when it’s released in a week or two.

In the mean time, something generic that’s usually worth doing is adding CUDA clock() calls to the entry and exit points of your raygen program to see how long it’s taking. The OptiX 6 sample optixMotionBlur has example code that you can trigger with the ‘t’ key (‘t’ is for time-view). If you have any specific tricky problems to profile in some OptiX 6 code, let me know and I can offer more guidance.


David.