Access multiple BVH parallel

novice · July 18, 2023, 2:46am

Can I build multiple BVHs and then launch rays to traverse these BVHs parallel? I mean, optixLaunch is called by several host threads.

If this operation can be done, what is its performance compared to traversing the BVHs serially?

Thanks in advance.

droettger · July 18, 2023, 7:12am

Let’s change the terminology a little and start from the bottom.

Can I build multiple BVHs

An acceleration structure (AS) is represented by a traversable handle (an opaque 64bit value) and a CUDA device pointer with the AS data (usually a BVH).
You can build and hold as many AS as you want, only limited by the available VRAM of your system configuration.

then launch rays to traverse these BVHs parallel

OptiX uses a single ray programming model and each optixTrace device call shoots one ray and takes a traversable handle argument which is the starting point (“world”, “scene”, “root”, “top-level object”, you name it) for the ray’s traversal through that specific AS.

So if you store multiple different traversable handles from different AS into your launch parameter block (or some buffer with traversable handles which pointer and size you store in your launch parameters), your ray generation program could select different AS for different rays in a single launch.
Since you’re responsible for the ray generation program implementation you can manage your input and output data as you like.

The optixLaunch dimension is defining how many invocations of the ray generation program are done (one per launch index). That launch dimension is limited to 2^30 in OptiX (see the Limits chapter inside the programming guide). So as many of these launch indices (threads) as possible are processed in parallel automatically, depending on the underlying GPU device.

I mean, optixLaunch is called by several host threads.

Yes, it’s also possible to run multiple optixLaunch calls concurrently and that doesn’t even need multiple host threads.
That requires separate CUDA streams (better not the default CUDA stream zero which might have different synchronization behavior depending on how the CUDA context has been created, read the CUDA driver API about that), different launch parameter blocks, and unfortunately different OptiX pipelines at this time. See the yellow warning box in this chapter of the OptiX Programming Guide:
https://raytracing-docs.nvidia.com/optix7/guide/index.html#ray_generation_launches#ray-generation-launches

If this operation can be done, what is its performance compared to traversing the BVHs serially?

If that helps at all depends on the GPU workload you’re generating with each optixLaunch.
If your launch dimensions are reasonably sized to saturate your installed GPU, then I don’t expect any benefit from multiple parallel optixLaunch calls and I would never do that. That will only help when your launch sizes are too small.

Mind that in the end the work needs to be done by the same hardware units and I don’t know how much overhead switching between these different streams and kernels incurs.

All OptiX calls taking a CUDA stream argument are asynchronous. Meaning you can also send many optixLaunch calls into the same CUDA stream and they will be processed as quickly as possible in the order they have been submitted.
That would only need one OptiX pipeline and, to make that fully asynchrounous, some careful asynchronous memory copies of the different launch parameters from host to device in between. If you prepare these launch parameter blocks for all launches upfront in separate host memory locations then there should be no issues with asynchronous cudaMemcpyAsync/cuMemcpyAsync calls.
That would obviously also require to have different result output buffers per optixLaunch if this should work on different AS, unless this is some progressive accumulation algorithm on the result data.
You would only need to synchronize once at the end to wait for all asynchronous optixLaunch calls to be finished.

Using one traversable handle for each optixLaunch might actually be beneficial with future OptiX versions when using the Shader Execution Reordering feature

In summary, I would not recommend using multiple optixLaunch calls running in parallel on separate streams if their workload is reasonably sized.

novice · July 18, 2023, 8:57am

Thanks for your detailed explanation!

system · August 1, 2023, 8:57am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
BVH building algorithm and primitive order OptiX	17	2422	June 14, 2022
Take full advantage of CUDA core and RT core OptiX	1	2270	February 6, 2023
Two programs launch on one GPU concurrently OptiX	5	87	October 31, 2024
Optix7.0: Could I use two streams for two optixLaunch operation in two threads for speed-optimize? OptiX	2	1037	June 14, 2022
Multi-process access to a single Optix Context OptiX	5	767	June 14, 2022
Is it possible to call optixTrace from custom intersection? OptiX	15	1167	June 14, 2022
Run optixAccelBuild asynchronously OptiX c-plus-plus , ray-tracing	5	1423	December 8, 2021
Understanding optixParticleVolumes OptiX	13	1044	June 14, 2022
Decomposing BVH to accelerate traversal OptiX	1	506	February 26, 2024
Dynamic Parallelism in OptiX? OptiX	7	1713	June 14, 2022

Access multiple BVH parallel

Related topics