Lack of support for threadfence in Optix IR

I’m working on a task that involves communication between an Optix shader and an ordinary cuda kernel running concurrently on the same device. My implementation of a work queue uses __threadfence() to synchronize some memory accesses. It works fine when I compile my modules via PTX, but when I enable Optix-IR I get the following error:

COMPILER (2) [0]: COMPILE ERROR: Malformed input. See compile details for more information.
Error: Function llvm.nvvm.membar.gl is not supported. Called from: __threadfence (No source location available. The input PTX may not contain debug information (nvcc option: -lineinfo), OptixModuleCompileOptions::debugLevel set to OPTIX_COMPILE_DEBUG_LEVEL_NONE, or no useful information is present for the current block.)

I understand that warp-level coordination primitives are disallowed, but are other synchronization primitives like threadfence unsupported? Am I doing anything incorrect by using threadfence?

Thanks for the report.

Could you please add your system configuration information?
OS version, installed GPU(s), VRAM amount, display driver version, OptiX (major.minor.micro) version, CUDA toolkit version (major.minor) used to generate the input PTX, host compiler version.

In case you’re not on the latest available display drivers for your system, could you try if updating the display driver changes the behavior?

That the compiler is not supporting that instruction inside the intermediate LLVM representation on the OptiX IR code path sound like a bug which would need to be fixed inside a future display driver, especially if the same works with PTX.

On the other hand, the OptiX Programming Guide has this to say about memory fences:
“The memory model is consistent only within the execution of a single launch index, which starts at the ray-generation invocation and only with subsequent programs reached from any optixTrace or callable program. This includes writes to stack allocated variables. Writes from other launch indices may not be available until after the launch is complete. If needed, atomic operations may be used to share data between launch indices, as long as an ordering between launch indices is not required. Memory fences are not supported.
https://raytracing-docs.nvidia.com/optix8/guide/index.html#program_pipeline_creation#programming-model

I would like to discusse this internally to see if that also affects your use case or is only meant between threads inside the OptiX device code.

Could you explain some more what kind of “communication between an OptiX shader and an ordinary CUDA kernel running concurrently on the same device” you’re doing and what problem that solves during an OptiX kernel launch?

I would be generally wary of all CUDA synchronization commands inside OptiX device code. It might not be a good idea to halt a thread until outstanding memory writes have been committed, and even less so after OptiX 8.0.0 added support for Shader Execution Reordering which will reassign work to threads as it wants.
OptiX is designed with a single-ray execution model and which thread handles what OptiX launch index is intentionally abstracted from the developer.

Thank you for the quick response.
I am running Ubuntu 22.04.3 LTS x86_64, GeForce RTX 4080 16GB, Driver 535.113.01, CUDA 12.2, OptiX 80.0.0, GCC 11.3.0. I think this is the latest driver version.

It is unfortunate to see that memory fences are not supported. The motivation for the project is that since many warp-wide optimizations and shared memory are not allowed in OptiX, if instead a shader could send a task to a persistent worker CUDA kernel that would perform the computation more efficiently and communicate the result back to the shader. Targeting tensor cores is a major application here, since these instructions are explicitly unsupported in Optix’s compilation.

In theory, such communication is entirely independent of launch index. Because threads are communicating with an external kernel instead of another optix thread, there’s no reason why they need to be identified by their launch index externally. Thus, there are no problematic dependencies between threads.

Do you think memory fences might be viable in such a case?

Thanks, I updated the bug report with your system configuration information.

I really don’t know if what your doing is a viable solution. That’s why I wanted to know why you’re doing that to ask the OptiX core developers internally.

If you want to have all native CUDA programming methods available in your ray tracing application, another approach would be to implement a wavefront ray casting renderer where OptiX is only used for the ray-primitive intersection and the ray generation and all shading calculation would be done between optixLaunch calls with native CUDA kernels inside the same stream.
Since these kernel launches are asynchronous to the CPU, they would be processed in order on that CUDA stream as fast as possible.

The drawback of that approach is memory bandwidth for the ray input and hit/miss output results and the need to implement a nice processing pipeline where the work is chunked into GPU-saturating pieces.

There are multiple professional renderers which use this approach.

There is a very simple example of that approach inside the OptiX SDK optixRaycasting example, but that is lacking any iteration and chunking of work. It just shoots primary rays and does some shading with the normal vector on a model and saves that as image.

Related posts:
https://forums.developer.nvidia.com/t/branch-divergence/176258
https://forums.developer.nvidia.com/t/task-scheduling-in-optix-7/167050

The analysis of the issue revealed the following:

OptiX does not support __threadfence() at this time.

That OptiX IR is throwing an error during validation is the expected behavior.
It generated an internal instruction which is disallowed in later compilation steps.

That the PTX input seemed to work is because during compilation all __threadfence() (resp. PTX membar) instructions were removed and didn’t reach the error condition during validation as in the OptiX IR code. That is going to be fixed by also reporting the error in the future.

The resulting kernel was not doing what you expected in the PTX case.

If you really need all native CUDA features during your ray generation or shading computations, the described wavefront renderer architecture would allow that today.