[ 4][ DISKCACHE]: Cache miss for key: ptx-3297480-key824b85d820cecbce03d34376f92ed2f5-sm_86-rtc1-drv571.96
[ 2][ COMPILER]: COMPILE ERROR: Malformed input. See compile details for more information.
Error: Mask argument to llvm.nvvm.match.any.sync.i32 is not the result of a call to __activemask: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.8/include/crt/sm_70_rt.hpp:94:3
I have chased down and dumbed down the code the following, within a forceinline function from a CH program.
unsigned am = __activemask();
unsigned value = 5;
unsigned res= __match_any_sync(am, value);
If I remove the third line the compilation succeeds. What could be the problem?
When you look into the OptiX_Programming_Guide_8.0.0.pdf at 6.2 on page 63 you see: […]shared memory usage and warp-wide or block-wide synchronization — such as barriers — are not allowed in the input PTX code.[…]
[…] Special warp-wide instructions like vote and ballot are allowed,[…], warp-wide instructions can be used safely when the algorithm in question is independent of locality .
__match_any_sync() is not (necessarily) warp-wide synchronizing when used with __activemask(). It was my understanding that opportunistic warp-level programming is ok in optix. It would be great go get clarification on that.
Warp intrinsics that compile should be okay to use, at least in raygen, with the caveats that you can only use the active mask, and you can’t change the active mask before using it. For example a call to optixTrace(), optixReorder(), or calling functions or callables can invalidate the active mask. Higher level intrinsics (block-sync, etc.) are not supported in OptiX.
What SM version are you targeting? I can’t immediately repro the compile error you’re seeing using an OptiX SDK sample, but I had to update the cmake CUDA_MIN_SM_TARGET and SAMPLES_NVCC_FLAG variables to use sm_70. I did use CUDA 12.8 - so that’s ruled out, but I’m on Windows 11 with a higher driver version and OptiX version.
Replacing -G with -lineinfo removes the problem and the code behaves as intended as far as I can tell. Optix ptx code have limited use of (-G), I am not sure what the difference is from the perspective of the validation code.
and you can’t change the active mask before using it
This is not the case for the problem at hand, but is it ok to carefully manipulate the active mask?
unsigned am = __activemask();
unsigned value = 5;
if(treadIdx.x % 32 == 0) return;
am = am & ~1; //mask out thread 0
unsigned res= __match_any_sync(am, value);
Aha, thanks for the compiler flags, I needed the debug flag to repro. This does happen for me, so I’ve filed a bug report. Unfortunately it’ll take a while for the fix, once completed, to percolate through our branching, QA, and release process. Will you be able to manage without this in the mean time?
is it ok to carefully manipulate the active mask?
Unfortunately, no. The compiler needs to be able to tell you used __activemask(), and it doesn’t know that the mask matched the thread id check in your return statement. Luckily, it’s easy to call __activemask() again instead.
Thank you for looking into it and reproducing. I can work around this problem for now.
Unfortunately, no. The compiler needs to be able to tell you used __activemask() , and it doesn’t know that the mask matched the thread id check in your return statement. Luckily, it’s easy to call __activemask() again instead.
Final question, as this relates the the problem the validator is trying to catch: Is not-touching-the-activemask requirement an optix constraint, or is it a general cuda model requirement? I am concerned that although maybe not a problem in practice, the volta execution model allows for even two consecutive calls to __activemask() to return different values due to incidental divergence.
Oh this requirement that the mask must be the unmodified result of a call to __activemask() is entirely an OptiX constraint. In CUDA, you can do anything you want with the mask. ;) This is related to what @m001 pointed out. OptiX is a single-threaded programming model, and that means as a rule the parallel execution is to be considered an implementation detail. We’re trying to allow the limited warp communication, but yeah it’s a bit limited compared to CUDA.