Optix compilation and __match_any_sync()

john.2 · January 30, 2025, 8:23pm

I am getting an error in validation mode:

[ 4][   DISKCACHE]: Cache miss for key: ptx-3297480-key824b85d820cecbce03d34376f92ed2f5-sm_86-rtc1-drv571.96

[ 2][    COMPILER]: COMPILE ERROR: Malformed input. See compile details for more information.
Error: Mask argument to llvm.nvvm.match.any.sync.i32 is not the result of a call to __activemask: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.8/include/crt/sm_70_rt.hpp:94:3

I have chased down and dumbed down the code the following, within a forceinline function from a CH program.

    unsigned am = __activemask();
    unsigned value = 5;
    unsigned res= __match_any_sync(am, value);

If I remove the third line the compilation succeeds. What could be the problem?

Optix 8.0.0 Windows 10 Cuda 12.8 571.96

m001 · February 1, 2025, 12:22am

When you look into the OptiX_Programming_Guide_8.0.0.pdf at 6.2 on page 63 you see:
[…]shared memory usage and warp-wide or block-wide synchronization — such as barriers — are not allowed in the input PTX code.[…]

__match_any_sync is mentioned among all the sync functions as “Synchronized data exchange” on:
https://developer.nvidia.com/blog/using-cuda-warp-level-primitives

john.2 · February 1, 2025, 5:53pm

optix programming guide:

[…] Special warp-wide instructions like vote and ballot are allowed,[…], warp-wide instructions can be used safely when the algorithm in question is independent of locality .

__match_any_sync() is not (necessarily) warp-wide synchronizing when used with __activemask(). It was my understanding that opportunistic warp-level programming is ok in optix. It would be great go get clarification on that.

dhart · February 3, 2025, 6:56pm

Hi @john.2,

Warp intrinsics that compile should be okay to use, at least in raygen, with the caveats that you can only use the active mask, and you can’t change the active mask before using it. For example a call to optixTrace(), optixReorder(), or calling functions or callables can invalidate the active mask. Higher level intrinsics (block-sync, etc.) are not supported in OptiX.

What SM version are you targeting? I can’t immediately repro the compile error you’re seeing using an OptiX SDK sample, but I had to update the cmake CUDA_MIN_SM_TARGET and SAMPLES_NVCC_FLAG variables to use sm_70. I did use CUDA 12.8 - so that’s ruled out, but I’m on Windows 11 with a higher driver version and OptiX version.

–
David.

john.2 · February 4, 2025, 9:15am

Relevant compiler flags are

-gencode=code=sm_75,arch=compute_7 --machine=64 --ptx --std=c++17 --expt-relaxed-constexpr;-use_fast_math --ptxas-options=-allow-expensive-optimizations,true -g -G

Replacing -G with -lineinfo removes the problem and the code behaves as intended as far as I can tell. Optix ptx code have limited use of (-G), I am not sure what the difference is from the perspective of the validation code.

and you can’t change the active mask before using it

This is not the case for the problem at hand, but is it ok to carefully manipulate the active mask?

unsigned am = __activemask();
unsigned value = 5;
if(treadIdx.x % 32 == 0) return;
am = am & ~1; //mask out thread 0
unsigned res= __match_any_sync(am, value);

dhart · February 5, 2025, 12:31am

Aha, thanks for the compiler flags, I needed the debug flag to repro. This does happen for me, so I’ve filed a bug report. Unfortunately it’ll take a while for the fix, once completed, to percolate through our branching, QA, and release process. Will you be able to manage without this in the mean time?

is it ok to carefully manipulate the active mask?

Unfortunately, no. The compiler needs to be able to tell you used __activemask(), and it doesn’t know that the mask matched the thread id check in your return statement. Luckily, it’s easy to call __activemask() again instead.

–
David.

john.2 · February 5, 2025, 8:16am

Thank you for looking into it and reproducing. I can work around this problem for now.

Unfortunately, no. The compiler needs to be able to tell you used __activemask() , and it doesn’t know that the mask matched the thread id check in your return statement. Luckily, it’s easy to call __activemask() again instead.

Final question, as this relates the the problem the validator is trying to catch: Is not-touching-the-activemask requirement an optix constraint, or is it a general cuda model requirement? I am concerned that although maybe not a problem in practice, the volta execution model allows for even two consecutive calls to __activemask() to return different values due to incidental divergence.

dhart · February 5, 2025, 5:11pm

Oh this requirement that the mask must be the unmodified result of a call to __activemask() is entirely an OptiX constraint. In CUDA, you can do anything you want with the mask. ;) This is related to what @m001 pointed out. OptiX is a single-threaded programming model, and that means as a rule the parallel execution is to be considered an implementation detail. We’re trying to allow the limited warp communication, but yeah it’s a bit limited compared to CUDA.

–
David.

system · February 19, 2025, 5:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OptiX 7.3: nvrtcCompileProgram reports all errors with line number 1 OptiX cuda	8	1015	June 14, 2022
Assertion failed: "instr->modifiers.CC == 0" OptiX	18	4089	June 14, 2022
Ptxas error while migrating from OptiX 6.0 to 7.2 OptiX	7	1987	October 12, 2021
Optix compile error in PTXString OptiX	4	2407	June 14, 2022
Lack of support for threadfence in Optix IR OptiX	4	601	October 20, 2023
Strange error while reading a PTX file OptiX	27	4656	June 14, 2022
Compile optix without Cmake OptiX	13	1728	June 15, 2022
Freeze on Sync after launch OptiX	9	1522	June 14, 2022
OptiX and CUDA interoperability OptiX	9	2085	June 14, 2022
NVRTC Compilation failed OptiX	6	2001	October 7, 2022

Optix compilation and __match_any_sync()

Related topics