The size of an OptiX launch and computing resources

whix · May 2, 2023, 8:04am

In order to reduce workload imbalance, I split the task of a single ray with heavy workload into multiple rays, thus increasing the size of optix launch (< 2^{30}). But when I increase to a certain value, the overall performance is no longer improved. Is this because the computing resources are not enough to support too many threads? If I combine the work of lightly loaded rays into a single ray, thereby further eliminating the workload imbalance, will the overall performance improve?
Thank you!

droettger · May 2, 2023, 8:40am

Please understand that it’s not possible to answer questions like this without knowing

what the underlying hardware is,
what the number of rays are,
what the algorithm is doing exactly,
what the absolute performance numbers are in your experiments.

Since I do not know how you determine which launch indices have a “heavy workload” I cannot precisely explain how to balance that for your use case.

If you’re concerned about the occupancy inside your kernel launches, yes, there are, for example, methods inside an iterative path tracer to keep the launch indices more balanced.
For example, depending on how you determine that some launch indices do less work than others (some counter, clock), you could let shorter running launch indices do additional work and somehow track that.
That will obviously not decrease the per launch runtime, but could help reducing the overall runtime when the algorithm needs many launches.

You could also implement some worker algorithm where the launch dimension is smaller and each launch index fetches more work from some work queue.

You could also launch many smaller kernels asynchronously. The per launch overhead is small. It’s basically only the time to copy the launch parameter block into constant memory and the kernel launch which both take only microseconds.

The problem is always to balance the per launch workload against the optixLaunch time.
There are system configurations where kernel launches should be kept below an OS timeout threshold. It depends on the OS and underlying hardware what happens otherwise.

In any case, Nsight System and Nsight Compute profiling could be used to analyze what happens and where the bottlenecks are.
Please read this thread if Nsight Compute is not showing your deice code.
https://forums.developer.nvidia.com/t/debugging-is-broken-after-updating-to-cuda-12-1/245616

Topic		Replies	Views
3D OptixLaunch to accommodate multiple viewpoints OptiX	4	1112	October 12, 2021
Launch dimensions in LaunchContextnD and optixLaunch OptiX	5	1609	October 12, 2021
Launch index x must be bigger than y? OptiX	3	665	June 14, 2022
Total number of threads corresponding to launch configuration OptiX	2	553	June 14, 2022
rtContextLaunch1D with multiple GPUs OptiX	3	740	June 14, 2022
Recommendations for splitting work between GPUs OptiX	4	2481	February 12, 2024
Fill output buffer from multiple threads OptiX	8	1392	October 12, 2021
Splitting work on multiple launches OptiX	6	1957	June 14, 2022
Multi GPU OptiX	7	3135	June 14, 2022
How many rays can be processed in parallel OptiX	1	607	August 14, 2023

The size of an OptiX launch and computing resources

Related topics