In general it is beneficial for BVH traversal performance to have convergent ray loads.
The crucial question is, how much do you save during traversal with sorted rays?
That would be the maximum time you have available for sorting just to break even.
Now that time got a lot shorter with the RT cores on Turing because the hardware BVH traversal is really fast.
The divergence in shading different surface hits is also a factor in that.
Still, divergent rays will access more memory. On the other hand, manually sorting rays will add more work and memory accesses between launches.
Means all this is GPU and scene dependent and would need to be measured individually.
As a start, maybe just partition rays into the eight buckets of the ray direction octants defined by the components’ signs.
The single ray programming model in OptiX is meant to allow internal scheduling to be changed.
What you describe is effectively the wavefront tracing used in the OptiX Prime API which was discontinued in OptiX 7. (There exists an optixRaycasting SDK example instead.)
With a maximum of 10 GRays/s on the high-end Turing GPUs, alone the reading and writing of the ray queries and hit records would run into memory limits (you could only read around 64 bytes per ray at that rate) and then you haven’t done any shading processing, yet.
It should be more efficient to keep the RT cores running for multiple ray segments using the built-in continuation mechanisms provided by the OptiX program domains.