Lack of support for threadfence in Optix IR

Thanks, I updated the bug report with your system configuration information.

I really don’t know if what your doing is a viable solution. That’s why I wanted to know why you’re doing that to ask the OptiX core developers internally.

If you want to have all native CUDA programming methods available in your ray tracing application, another approach would be to implement a wavefront ray casting renderer where OptiX is only used for the ray-primitive intersection and the ray generation and all shading calculation would be done between optixLaunch calls with native CUDA kernels inside the same stream.
Since these kernel launches are asynchronous to the CPU, they would be processed in order on that CUDA stream as fast as possible.

The drawback of that approach is memory bandwidth for the ray input and hit/miss output results and the need to implement a nice processing pipeline where the work is chunked into GPU-saturating pieces.

There are multiple professional renderers which use this approach.

There is a very simple example of that approach inside the OptiX SDK optixRaycasting example, but that is lacking any iteration and chunking of work. It just shoots primary rays and does some shading with the normal vector on a model and saves that as image.

Related posts:
https://forums.developer.nvidia.com/t/branch-divergence/176258
https://forums.developer.nvidia.com/t/task-scheduling-in-optix-7/167050