CUDA Warp primitive behaviour question

Hello all, Now I’m learning the warp level data exchange primitives. I have some question on the mask parameter.

I’m now conducting the intra-warp data exchange with following code, which matches my expected output well:

if ((1 << lane_id) & (uint32_t) 0x5a5a5a5a) {
        *(uint32_t*)(&data[0]) = __shfl_xor_sync(0xffffffff, *(uint32_t*)(&data[0]), 5);

While I want to use the mask parameter to do the same, as:

*(uint32_t*)(&data[0]) = __shfl_xor_sync(0x5a5a5a5a, *(uint32_t*)(&data[0]), 5);

The result indicated that ALL threads in the warp participate the data exchange, while I only want the threads with threadIdx.x % 8 == 1,3,4,6 to participate the data exchange (So I set the mask to be 0x5a5a5a5a (0x5a=0b 0101 1010) which exactly indicate the 1, 3, 4 and 6 thread). How should I use the mask parameter to achieve my expected effect?

The mask doesn’t exclude threads. The mask guarantees convergence to at least the level specified in the mask. But the mask does not exclude threads “not selected” in the mask, nor does it prevent those “not selected” threads from participating in the op.

If you want only certain threads to participate, use a boolean if condition to select those threads, before running the primitive. Make sure your mask is consistent with your conditional selection of threads.

  • Depending on results not supported by the mask is undefined behavior.
  • Make sure that both source and necessary destination lanes are included in the mask as well as appropriately selected by your conditional code.

Thanks for your reply, guess it’s good to use my current solution.