That needs some more details to answer:
First of all, about how many rays and cubes are we talking?
What are the optixLaunch dimensions?
That image shows like 50 cubes and 15 rays?
That would absolutely not saturate any OptiX 7 supported GPU and all your benchmarks would measure more of the overhead than the actual ray tracing performance.
So what exactly are you measuring and how?
Mind that all OptiX API calls taking a stream argument, esp. optixAccelBuild and optixLaunch, are asynchronous CUDA kernel launches which require stream synchronizations around the host call when wanting to benchmark the time they take.
Then what is the system configuration you’re measuring this on?
That’s required to judge if the absolute performance measurement data (e.g. in millions of rays traced per second) would be reasonable,
OS version, installed GPU(s), VRAM amount, display driver version, OptiX (major.minor.micro) version, CUDA toolkit version (major.minor) used to generate the input PTX, host compiler version.
Then since you mentioned world sizes, distances between rays, and ray lengths, what are these values in absolute units?
We found that if we reduce the number of cubes to make them sparse, increasing the distance between rays [image] can improve the performance of the optixLaunch method.
If you reduce the number of primitives inside the scene, the acceleration structure gets smaller.
Smaller acceleration structures are simpler to traverse. Also maybe more rays miss all primitives which would also be faster.
If the primitives’ AABBs inside the sparser acceleration structure overlap less, that would also speed up the scene traversal.
However, if we increase the overall size of the space to make the cubes sparse, increasing the distance between rays does not improve performance, but increasing the length of the rays does.
If you scale the positions of the cubes and keep the cubes the same size, did you also scale the ray grid to result in exactly the same number of closest intersections?
Then the performance should be similar.
If changing the ray length results in more AABBs needing to be tested during the traversal, then that would affect performance. When placing the cubes father apart while not increasing their size, fewer intersections on the longer rays might be required than in the denser placement of cubes. So it should actually get slower, not faster when increasing the ray length.
Are you gathering all hits along the ray or only the closest?
Could you elaborate what you’re trying to achieve and why?