What does mask mean in warp shuffle functions (__shfl_sync)

I am trying to understand the mask parameter in shuffle functions, e.g.

T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);

My understanding is that only threads indicated by 1 bits in ‘mask’ do the exchange. If mask = 0xffffffff, it means all threads in the warp will do the exchange. If mask = 0x00000000, no threads will do the exchange. However, I didn’t get the expected result. Following is the test code.

#include <stdio.h>
#include "cuda_runtime.h"

__global__ void kernel(double *a){
  double v = a[threadIdx.x];

  unsigned mask = 0x00000000;//0xffffffff;//0x000000ff;
  unsigned int offset = 4;
  v += __shfl_down_sync(mask, v, offset, 8);

  a[threadIdx.x] = v;
}

void main(){
  double *a, *a_d;
  a = (double*)calloc(32,sizeof(double));
  cudaMalloc((void **)&a_d,32*sizeof(double));
  for(int i=0;i<32;i++){ a[i]=i/4; }
  cudaMemcpy(a_d, a, 32*sizeof(double), cudaMemcpyHostToDevice);

  for(int i=0;i<32;i++){ printf("%2.0f ",a[i]); }
  printf("\n");

  kernel<<<1,32>>>(a_d);

  cudaMemcpy(a, a_d, 32*sizeof(double), cudaMemcpyDeviceToHost);
  for(int i=0;i<32;i++){ printf("%2.0f ",a[i]); }
}

I got the same results (all threads do the exchange) no matter which mask is used (mask=0xffffffff, mask=0x00000000, mask=0x000000ff).

I am wondering if it is a bug or my understanding of the mask parameter is wrong? I am using a Tesla V100 and CUDA 10.0.

Thanks.

Your understanding is incorrect. Furthermore you are exploring undefined behavior. Therefore my explanation of behavior below should NOT be considered a proper expectation of functionality. For correctness, you must specify a mask parameter which includes the warp lanes you expect to participate. The behavior of lanes outside the mask is undefined (because, in fact, Volta provides no guarantees of warp convergence, except those the programmer specifically asks for, and therefore a warp lane with a zero bit in the mask implies that the specified lane may or may not participate.)

The mask parameter says the following:

“These are the warp lanes that must participate for correctness.”

The compiler will generate the necessary instructions to reconverge those threads if they are not already converged.

Thereafter the warp shuffle proceeds for the current state of the warp.

There is no other implied behavior. Regardless of the mask, after the reconvergence step, the result of the warp shuffle operation will be the result you would get for whichever threads happen to be participating. A zero bit in the mask argument does not prevent a warp lane from participating, it merely does not guarantee that such a lane will participate if the warp is in a diverged state.

There is an important caveat here. The mask parameter will create a reconvergence of the indicated threads. However it cannot cause reconvergence of threads that your code has made impossible.

For example, this is illegal (will result in undefined behavior for warp 0):

if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);

The relevant section in the programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

states:

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

2 Likes

Thanks for your detailed explanation. Now it is very clear :)