Test code:
#include <stdio.h>
__global__
void kernel1() // kernel 1 is launched with 7 threads
{
int tid = threadIdx.x;
int val = tid;
val = __shfl_up_sync(0xFFFFFFFFU, val, 1); // all 7 threads execute this line, full mask is used
printf("thread %i 's val is %i\n", tid, val);
}
__global__
void kernel2() // kernel 2 is launched with 7 threads
{
int tid = threadIdx.x;
int val = tid;
if (tid < 5)
val = __shfl_up_sync(0xFFFFFFFFU, val, 1); // thread 5 and 6 does not execute this line
printf("thread %i 's val is %i\n", tid, val);
}
int main()
{
kernel1<<<1,7>>>(); // kernel 1 is launched with 7 threads
cudaDeviceSynchronize();
printf("kernel1 done\n");
kernel2<<<1,7>>>(); // kernel 2 is launched with 7 threads
cudaDeviceSynchronize();
printf("kernel2 done\n");
}
Compiling and running above code in 1080 Ti (Pascal), I get:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
$ nvcc warp.cu
$ ./a.out
thread 0 's val is 0
thread 1 's val is 0
thread 2 's val is 1
thread 3 's val is 2
thread 4 's val is 3
thread 5 's val is 4
thread 6 's val is 5
kernel1 done
thread 0 's val is 0
thread 1 's val is 0
thread 2 's val is 1
thread 3 's val is 2
thread 4 's val is 3
thread 5 's val is 5
thread 6 's val is 6
kernel2 done
$
Compiling and running above code in 2080 Ti (Turing), I get:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
$ nvcc warp.cu
$ ./a.out
thread 0 's val is 0
thread 1 's val is 0
thread 2 's val is 1
thread 3 's val is 2
thread 4 's val is 3
thread 5 's val is 4
thread 6 's val is 5
kernel1 done
thread 5 's val is 5
thread 6 's val is 6
thread 0 's val is 0
thread 1 's val is 0
thread 2 's val is 1
thread 3 's val is 2
thread 4 's val is 3
kernel2 done
$
In the CUDA C Programming guide, it says:
“All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.”
https://docs.nvidia.com/cuda/archive/10.0/cuda-c-programming-guide#warp-description
In kernel2, I expected the test to fail in both GPUs because on line 18, some non-exited threads named in mask, namely tid=5 and tid=6, does not execute this line. But it did not fail.
It also says in the CUDA C Programming guide that:
“For Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp’s active mask.”
https://docs.nvidia.com/cuda/archive/10.0/cuda-c-programming-guide#independent-thread-scheduling-7-x
Therefore, I expected that for 1080 Ti (Pascal), kernel 1 should fail because on line 8, warp’s active mask, which should be 0x0000007F, is not equal to the mask value of 0xFFFFFFFF. But it also did not fail.
So my question is whether the example code I’ve posted is not safe that it only succeeded in this case by luck, or it is a safe code and I’ve misunderstood the programming guide document.