As the number of times the rays get bounced around is different for different points in the scene, there must be significant gpu branch divergence, hence poor performance. Is there a study on this; what can be done about it?
There are different sources of divergence, so it depends on the scene & renderer whether divergence will be a problem. One source of divergence is ray traversal, intersection, and any-hit shaders. When rays in a warp are traversing different depths of an acceleration structure, divergence happens. If the rays are going through parts of the scene with very different geometric density, divergence appears.
Another source of divergence is shading materials in the scene. If each ray in a warp hits a surface that uses a unique material shader, the cost will be significantly higher than if all the rays hit the same material.
Yes there are studies on this topic, and yes there are things you can do to improve situations that have significant divergence. For traversal, using the RTX hardware traversal and intersection for all of the scene geometry is a major way to both improve performance and cut down divergence. Avoiding any-hit programs and custom intersection programs when possible is another way to improve performance and potentially improve divergence. (It’s not always possible to avoid any-hit programs or custom intersectors, and I don’t want to stigmatize the use of valid necessary features, but any-hit and intersection programs do interrupt hardware traversal to execute your code on the SMs, and so there’s significant overhead.)
For materials, some people try to use an “ubershader” - a single shader program that can handle many kinds of material properties. Others sometimes use a “wavefront” architecture that separates tracing work and shading work into separate passes, where the shading work can be sorted and scheduled in batches by material. A wavefront architecture also allows you consolidate ray tracing work at every step of path depth, so the kernels get smaller as you go, and warps stay compacted with active work on all threads.