I am trying to understand the mask parameter in shuffle functions, e.g.
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
My understanding is that only threads indicated by 1 bits in ‘mask’ do the exchange. If mask = 0xffffffff, it means all threads in the warp will do the exchange. If mask = 0x00000000, no threads will do the exchange. However, I didn’t get the expected result. Following is the test code.
#include <stdio.h>
#include "cuda_runtime.h"
__global__ void kernel(double *a){
double v = a[threadIdx.x];
unsigned mask = 0x00000000;//0xffffffff;//0x000000ff;
unsigned int offset = 4;
v += __shfl_down_sync(mask, v, offset, 8);
a[threadIdx.x] = v;
}
void main(){
double *a, *a_d;
a = (double*)calloc(32,sizeof(double));
cudaMalloc((void **)&a_d,32*sizeof(double));
for(int i=0;i<32;i++){ a[i]=i/4; }
cudaMemcpy(a_d, a, 32*sizeof(double), cudaMemcpyHostToDevice);
for(int i=0;i<32;i++){ printf("%2.0f ",a[i]); }
printf("\n");
kernel<<<1,32>>>(a_d);
cudaMemcpy(a, a_d, 32*sizeof(double), cudaMemcpyDeviceToHost);
for(int i=0;i<32;i++){ printf("%2.0f ",a[i]); }
}
I got the same results (all threads do the exchange) no matter which mask is used (mask=0xffffffff, mask=0x00000000, mask=0x000000ff).
I am wondering if it is a bug or my understanding of the mask parameter is wrong? I am using a Tesla V100 and CUDA 10.0.
Thanks.