Intersection program called too frequently

I’ve written custom bounding box and intersection programs for a scene populated with circles (not spheres). It runs very slowly, and I noticed that the intersection program is being called a huge number of times (roughly 100 times per ray on average, when I would have expected 1 or 2 times per ray at most). As a sanity check, I copied the code of the bounding box program into the intersection program to check that each found intersection point is indeed inside the bounding box. As expected, the intersection program only makes it beyond the sanity check 1 or 2 times per ray, on average.

Why would the intersection program be called for ray intersections not inside the bounding box?

A few notes that may or may not be relevant:

  • I'm using a Bvh acceleration structure. I also tried an Sbvh, which actually caused the intersection program to execute even more often.
  • Each ray is very short, extending from t = -1e-6 to t = 1e-6. I'm just testing to see if there is a circle in close proximity to each of a set of given points. I've verified that rtIntersectionDistance is always within this range. Edit: The intersection program is indeed called for intersections that are outside of this range.
  • The rays are not particularly coherent. In particular, I don't expect very many rays to intersect the same circle.
  • All circles have the same material.
  • The any hit program calls rtIgnoreIntersection() because I want to know about all circles that overlap each point.
  • There is a miss program that does nothing.
  • There is no closest hit program defined.

Does OptiX test all threads in one warp for intersection with each object, even when only one thread in that warp has a ray that intersects the bounding box?

I’ve come up with a partial solution for incoherent rays under that assumption. I increased the number of threads by a factor and populated most of the threads with junk input such that they do not cast any rays. The optimal performance occurs when only 1 in 4 threads performs intersection testing (i.e., 8 active threads per warp). This change more than doubled the speed of my code.