The size of an OptiX launch and computing resources

In order to reduce workload imbalance, I split the task of a single ray with heavy workload into multiple rays, thus increasing the size of optix launch (< 2^{30}). But when I increase to a certain value, the overall performance is no longer improved. Is this because the computing resources are not enough to support too many threads? If I combine the work of lightly loaded rays into a single ray, thereby further eliminating the workload imbalance, will the overall performance improve?
Thank you!

Please understand that it’s not possible to answer questions like this without knowing

  • what the underlying hardware is,
  • what the number of rays are,
  • what the algorithm is doing exactly,
  • what the absolute performance numbers are in your experiments.

Since I do not know how you determine which launch indices have a “heavy workload” I cannot precisely explain how to balance that for your use case.

If you’re concerned about the occupancy inside your kernel launches, yes, there are, for example, methods inside an iterative path tracer to keep the launch indices more balanced.
For example, depending on how you determine that some launch indices do less work than others (some counter, clock), you could let shorter running launch indices do additional work and somehow track that.
That will obviously not decrease the per launch runtime, but could help reducing the overall runtime when the algorithm needs many launches.

You could also implement some worker algorithm where the launch dimension is smaller and each launch index fetches more work from some work queue.

You could also launch many smaller kernels asynchronously. The per launch overhead is small. It’s basically only the time to copy the launch parameter block into constant memory and the kernel launch which both take only microseconds.

The problem is always to balance the per launch workload against the optixLaunch time.
There are system configurations where kernel launches should be kept below an OS timeout threshold. It depends on the OS and underlying hardware what happens otherwise.

In any case, Nsight System and Nsight Compute profiling could be used to analyze what happens and where the bottlenecks are.
Please read this thread if Nsight Compute is not showing your deice code.
https://forums.developer.nvidia.com/t/debugging-is-broken-after-updating-to-cuda-12-1/245616