What does mask mean in warp shuffle functions (__shfl_sync)

jijn · November 22, 2018, 8:41am

I am trying to understand the mask parameter in shuffle functions, e.g.

T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);

My understanding is that only threads indicated by 1 bits in ‘mask’ do the exchange. If mask = 0xffffffff, it means all threads in the warp will do the exchange. If mask = 0x00000000, no threads will do the exchange. However, I didn’t get the expected result. Following is the test code.

#include <stdio.h>
#include "cuda_runtime.h"

__global__ void kernel(double *a){
  double v = a[threadIdx.x];

  unsigned mask = 0x00000000;//0xffffffff;//0x000000ff;
  unsigned int offset = 4;
  v += __shfl_down_sync(mask, v, offset, 8);

  a[threadIdx.x] = v;
}

void main(){
  double *a, *a_d;
  a = (double*)calloc(32,sizeof(double));
  cudaMalloc((void **)&a_d,32*sizeof(double));
  for(int i=0;i<32;i++){ a[i]=i/4; }
  cudaMemcpy(a_d, a, 32*sizeof(double), cudaMemcpyHostToDevice);

  for(int i=0;i<32;i++){ printf("%2.0f ",a[i]); }
  printf("\n");

  kernel<<<1,32>>>(a_d);

  cudaMemcpy(a, a_d, 32*sizeof(double), cudaMemcpyDeviceToHost);
  for(int i=0;i<32;i++){ printf("%2.0f ",a[i]); }
}

I got the same results (all threads do the exchange) no matter which mask is used (mask=0xffffffff, mask=0x00000000, mask=0x000000ff).

I am wondering if it is a bug or my understanding of the mask parameter is wrong? I am using a Tesla V100 and CUDA 10.0.

Thanks.

Robert_Crovella · November 22, 2018, 3:57pm

Your understanding is incorrect. Furthermore you are exploring undefined behavior. Therefore my explanation of behavior below should NOT be considered a proper expectation of functionality. For correctness, you must specify a mask parameter which includes the warp lanes you expect to participate. The behavior of lanes outside the mask is undefined (because, in fact, Volta provides no guarantees of warp convergence, except those the programmer specifically asks for, and therefore a warp lane with a zero bit in the mask implies that the specified lane may or may not participate.)

The mask parameter says the following:

“These are the warp lanes that must participate for correctness.”

The compiler will generate the necessary instructions to reconverge those threads if they are not already converged.

Thereafter the warp shuffle proceeds for the current state of the warp.

There is no other implied behavior. Regardless of the mask, after the reconvergence step, the result of the warp shuffle operation will be the result you would get for whichever threads happen to be participating. A zero bit in the mask argument does not prevent a warp lane from participating, it merely does not guarantee that such a lane will participate if the warp is in a diverged state.

There is an important caveat here. The mask parameter will create a reconvergence of the indicated threads. However it cannot cause reconvergence of threads that your code has made impossible.

For example, this is illegal (will result in undefined behavior for warp 0):

if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);

The relevant section in the programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

states:

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

jijn · November 22, 2018, 9:01pm

Thanks for your detailed explanation. Now it is very clear :)

Topic		Replies	Views
Question regarding the mask in __shfl_sync operations CUDA Programming and Performance	2	1259	January 16, 2020
How do mask work in warp primitive? CUDA Programming and Performance	5	262	May 31, 2023
confuse about warp-level mask CUDA Programming and Performance	12	3107	November 1, 2018
Shfl_sync question CUDA Programming and Performance	1	543	October 12, 2021
[q] mask in __shfl_sync() CUDA Programming and Performance cuda , kernel	5	60	September 13, 2024
CUDA Warp primitive behaviour question CUDA Programming and Performance	2	264	April 13, 2024
Warp shuffle over a mask CUDA Programming and Performance	0	473	August 20, 2020
__shfl_sync requires __syncwarp for every call? CUDA Programming and Performance	4	4370	February 25, 2018
Is this code with warp shuffle functions safe? CUDA Programming and Performance	3	611	February 3, 2019
Requesting clarification - CUDA WARP level primitives and THREAD divergence CUDA Programming and Performance hw , cuda	3	1086	February 14, 2024

What does mask mean in warp shuffle functions (__shfl_sync)

Related topics