Access multiple BVH parallel

Can I build multiple BVHs and then launch rays to traverse these BVHs parallel? I mean, optixLaunch is called by several host threads.

If this operation can be done, what is its performance compared to traversing the BVHs serially?

Thanks in advance.

Let’s change the terminology a little and start from the bottom.

Can I build multiple BVHs

An acceleration structure (AS) is represented by a traversable handle (an opaque 64bit value) and a CUDA device pointer with the AS data (usually a BVH).
You can build and hold as many AS as you want, only limited by the available VRAM of your system configuration.

then launch rays to traverse these BVHs parallel

OptiX uses a single ray programming model and each optixTrace device call shoots one ray and takes a traversable handle argument which is the starting point (“world”, “scene”, “root”, “top-level object”, you name it) for the ray’s traversal through that specific AS.

So if you store multiple different traversable handles from different AS into your launch parameter block (or some buffer with traversable handles which pointer and size you store in your launch parameters), your ray generation program could select different AS for different rays in a single launch.
Since you’re responsible for the ray generation program implementation you can manage your input and output data as you like.

The optixLaunch dimension is defining how many invocations of the ray generation program are done (one per launch index). That launch dimension is limited to 2^30 in OptiX (see the Limits chapter inside the programming guide). So as many of these launch indices (threads) as possible are processed in parallel automatically, depending on the underlying GPU device.

I mean, optixLaunch is called by several host threads.

Yes, it’s also possible to run multiple optixLaunch calls concurrently and that doesn’t even need multiple host threads.
That requires separate CUDA streams (better not the default CUDA stream zero which might have different synchronization behavior depending on how the CUDA context has been created, read the CUDA driver API about that), different launch parameter blocks, and unfortunately different OptiX pipelines at this time. See the yellow warning box in this chapter of the OptiX Programming Guide:

If this operation can be done, what is its performance compared to traversing the BVHs serially?

If that helps at all depends on the GPU workload you’re generating with each optixLaunch.
If your launch dimensions are reasonably sized to saturate your installed GPU, then I don’t expect any benefit from multiple parallel optixLaunch calls and I would never do that. That will only help when your launch sizes are too small.

Mind that in the end the work needs to be done by the same hardware units and I don’t know how much overhead switching between these different streams and kernels incurs.

All OptiX calls taking a CUDA stream argument are asynchronous. Meaning you can also send many optixLaunch calls into the same CUDA stream and they will be processed as quickly as possible in the order they have been submitted.
That would only need one OptiX pipeline and, to make that fully asynchrounous, some careful asynchronous memory copies of the different launch parameters from host to device in between. If you prepare these launch parameter blocks for all launches upfront in separate host memory locations then there should be no issues with asynchronous cudaMemcpyAsync/cuMemcpyAsync calls.
That would obviously also require to have different result output buffers per optixLaunch if this should work on different AS, unless this is some progressive accumulation algorithm on the result data.
You would only need to synchronize once at the end to wait for all asynchronous optixLaunch calls to be finished.

Using one traversable handle for each optixLaunch might actually be beneficial with future OptiX versions when using the Shader Execution Reordering feature

In summary, I would not recommend using multiple optixLaunch calls running in parallel on separate streams if their workload is reasonably sized.


Thanks for your detailed explanation!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.