Closest approach of ray to a point

I am writing a forward ray-tracing code in which “beams” are assumed to have certain radii of influence around their central rays. When it comes to summing the contributions of large numbers of beams to large numbers of point receivers, I would like to somehow represent each receiver as a sphere primitive (i.e., actually include them in the scene, rather than use a for-loop to calculate the possible influence of every ray on every receiver point). The problem is that I don’t know the radius of each beam ahead of time, so I am stuck with using maybe a few nested sphere primitives around each receiver to catch contributions from beams of different radii.

Hopefully that makes some kind of sense! Now - I was reading Ingo’s article which talks about keeping bounding volume tests in hardware and I was wondering if there might be a way to effectively set the sphere primitive bounding box size per ray in my intersection code by using some kind of instancing approach. On the face of it, this sounds impossible, because many different beams with different radii will pass the same receiver, but I thought I’d run it past experts with a better understanding of RTX etc to see if there might be any clever tricks to help me achieve this.

The problem is that I don’t know the radius of each beam ahead of time.

At what time inside that algorithm do you know the radius of each beam then?

When it comes to summing the contributions of large numbers of beams to large numbers of point receivers

Could you give some absolute numbers for these two large numbers?

Where do the beams come from?

Is there any influence of the receivers on the beams?
Means do beams simply go through everything and contribute to all receivers on their ways which are inside the beam’s radius or is there any absorption or scattering effects affecting the beam?

Is there any other geometry inside the scene which would need to be intersected?
(How would you do that with beams of different radii?)

Because if not, there wouldn’t be a need to solve this with ray tracing at all.

Calculating the distance of a line to point could simply be done inside a native CUDA kernel.

  • You have a number of receivers (points) and a number of rays (lines),
  • calculate the minimum distance between each point and each line.
  • if the distance is below a beam’s radius, add the beam’s contribution to the receiver,
  • repeat until all beams are handled.

If the number of receivers is big enough to saturate the GPU, this would be a super simple CUDA kernel which would just need to be launched as often as there are rays (gather algorithm, no atomics needed).
If one beam at a time is not saturating the kernel, chunk the work into bigger partitions, handle more beams per kernel launch.

It could also be implemented as a scatter algorithm over the beams where contributions are added to receivers with atomics if that saturates the GPU better. Might be slower though.

Now if you really need to use ray tracing for this, to be able to hit something, the BVH traversal algorithm needs to find AABBs which intersect with the ray. That traversal part is fully hardware accelerated on RTX boards when using single-level (GAS) or two-level (IAS → GAS) acceleration structure scene graphs in OptiX.

You cannot change the AABB sizes during traversal. Similar idea rejected here:

The simplest approach would be to implement the receivers as custom sphere primitives. Make them as big as as the biggest radius of all beams to be able to hit the AABB with any beam radius. Inside the custom intersection program, calculate the actual distance from the beam center ray, if it’s below the beam’s radius (needs to be stored inside the ray payload which is accessible inside the intersection program in OptiX 7) then add its contribution to the receiver.
Because this is a scatter algorithm (many beams can hit the same receiver) adding the contribution must happen with atomics.

This approach has the drawback that the receivers’ AABB have the maximum necessary radius and might potentially result in a lot of unnecessary custom intersection program invocations which are also interrupting the hardware BVH traversal.

It’s the simplest approach though and I would recommend implementing this first to see if this is good enough.
I would expect this to be slower than the native CUDA kernel method if that is applicable.

Other idea:
Implement the receivers as built-in triangle geometry.
It needs to be some convex hull containing the required receiver sphere.
For example, use a simple box made of 12 triangles. (Example code:
You could of course also use a geometry which is more spherical, like an icosahedron, to approximate the sphere better.
Intersecting triangles is again fully hardware accelerated on RTX boards.

Now there would be two different approaches.

  1. Using closest hit and continuation rays (iterative path tracer):
    Implement a closest hit program which get the geometry intersection.
    You only want one intersection per geometry, which could be done with face culling.
    Construct the box geometry to have all front faces on the outside.
    Enable face culling for backfaces on all boxes.
    When hitting the box, calculate the actual distance of the ray to the receiver center inside the closest hit program and when below the radius, add the contribution to the receiver with atomics.
    Return to the ray generation program and if the ray had not missed, shoot a new ray with the same direction and changed origin or t_min values to skip the previously hit box. Repeat until there is no more hit.

  2. Using an anyhit program:
    You could also use an anyhit program and do the calculation of the receiver to ray distance there, add the contribution to the receiver, again using atomics, and then ignore the intersection to continue the BVH traversal.
    Anyhit program invocations are not in ray direction order but depend on the BVH traversal order.
    You also need to specify that geometry can only invoke an anyhit program only once with OPTIX_GEOMETRY_FLAG_REQUIRE_SINGLE_ANYHIT_CALL.
    This method assumes no ordering is required and receivers do not affect the beams.
    It will end when there are no more intersections (which would reach the miss program which isn’t required here.

Additional idea:
Nest up to 8 of these AABB boxes for each receiver to handle different radii and assign different instance.visibilityMask bits to the instances holding each level.
Use the optixTrace visibilityMask to partition the intersected geometry by the beam’s radius by setting only one of the 8 bits to select the smallest required box level and let the hardware intersect only that,

I have no idea if that is actually faster than a single AABB with max radius. Nested AABB inside the traversal are usually bad.

This is all draft thinking and would need to be tried to see what works reasonably well.

Thank you very much Detlef for your detailed answer to my question. The problem I’m talking about is actually one in ocean acoustics which has other complications, most of which slow down performance by orders of magnitude relative to conventional ray tracing. Worst of all, my “beams” don’t follow straight ray paths but are broken up into continuously refracting curves by the sound speed profile. The length of each straight segment is set by an adaptive Runge Kutta integrator, so I’m trying to get away from the need to check receiver influences for every segment, for every ray and every receiver.

The beams in question are Gaussian beams which can expand and contract in radius depending on the sound speed gradient, which I sample from a layered texture. At the completion of each small segment, I will know the width(s) of each beam. As the name implies, the influence of each beam dies off as a Gaussian, so we don’t sum receiver influence beyond a certain number of beam radii.

So - to answer some of your questions: If I am interested in a dense sampling of 3-D transmission loss, then I may have a million receivers or so. The number of rays will be in the tens of thousands originating from a point source. Other geometry in the scene consists of the triangulated seafloor mesh and the sea surface mesh, as well as other objects in the water column.

I think your last idea of using nested AABB boxes and instance visibility masks is brilliant and we will look into implementing it. For small numbers of receivers, we have typically run a for-loop in each kernel as you said, but this idea may be what we need to handle large numbers of receivers. I’m curious though about your last comment regarding nested AABB inside the traversal usually being bad. Could you elaborate a bit?

Best Regards,

Now that’s a use case I haven’t seen, yet.
The nearest to that has been the simulation of the Icecube Neutrino Detector.

The BVH traversal performance gets better the less AABB overlap is in a scene.
For example, sorting the primitives of a city scene by material (glass, bricks, roof tiles, tarmac, etc.) would be the usual approach for a rasterizer to avoid material shader switching. But if you put all primitives sorted by material into separate GAS, then basically each of the GAS AABBs span the whole city and fully overlap each other, so the BVH traversal (starting at an IAS over these GAS) has to visit all of them and drill down to the primitives.
A more efficient BVH structure would be to spatially partition the city into less overlapping regions instead, e.g. by street blocks, then by houses. Then the instance level could already reject many primitives per ray.

With the concentric AABBs for the different receiver radii these 8 instances would almost completely overlap. But if they are filtered by the visibility mask, that’s probably not a problem with hardware BVH traversal on RTX boards. (Never tried myself.)

Similarly the performance of the simpler idea of using custom sphere primitives with AABBs sized to the biggest radius would depend on the resulting overlap of the receivers when doing that.
Means even if there is a million receivers, if their AABBs don’t overlap that would be perfectly fine.
Shooting 100,000 rays and intersecting with a million primitives shouldn’t be a problem if there isn’t too much overlap (without knowing potential runtime requirements). The memory bandwidth is usually the bottleneck on RTX boards.
The spatial acceleration structure probably beats the brute force native CUDA kernel idea then.

If BVH traversal performance is an issue (profile your OptiX device code with Nsight Compute), then it might make sense to partition the receivers into separate GAS as well. Like if they are in some regular 3D grid, simply partition them to equal sized 3D sub-region blocks.
If you’re using custom primitives, you need a top-level instance acceleration structure anyway because you cannot mix built-in triangles for the sea surface mesh or other mesh objects with custom primitives. These must be separate GAS.

Let us know how it goes. Our HPC Visualization group would surely be interested in your results.