How do mask work in warp primitive?

Your code has undefined behavior. Each thread that calls shfl_sync must have its bit set in the mask.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id36

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.