Is this code with warp shuffle functions safe?

Test code:

#include <stdio.h>

__global__
void kernel1()                                               // kernel 1 is launched with 7 threads
{
        int tid = threadIdx.x;
        int val = tid;
        val = __shfl_up_sync(0xFFFFFFFFU, val, 1);           // all 7 threads execute this line, full mask is used
        printf("thread %i 's val is %i\n", tid, val);
}

__global__
void kernel2()                                               // kernel 2 is launched with 7 threads
{
        int tid = threadIdx.x;
        int val = tid;
        if (tid < 5)
                val = __shfl_up_sync(0xFFFFFFFFU, val, 1);   // thread 5 and 6 does not execute this line
        printf("thread %i 's val is %i\n", tid, val);
}

int main()
{
        kernel1<<<1,7>>>();          // kernel 1 is launched with 7 threads
        cudaDeviceSynchronize();
        printf("kernel1 done\n");

        kernel2<<<1,7>>>();          // kernel 2 is launched with 7 threads
        cudaDeviceSynchronize();
        printf("kernel2 done\n");
}

Compiling and running above code in 1080 Ti (Pascal), I get:

nvcc --version</b> nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130 <b> nvcc warp.cu
./a.out</b> thread 0 's val is 0 thread 1 's val is 0 thread 2 's val is 1 thread 3 's val is 2 thread 4 's val is 3 thread 5 's val is 4 thread 6 's val is 5 kernel1 done thread 0 's val is 0 thread 1 's val is 0 thread 2 's val is 1 thread 3 's val is 2 thread 4 's val is 3 thread 5 's val is 5 thread 6 's val is 6 kernel2 done <b>


Compiling and running above code in 2080 Ti (Turing), I get:

nvcc --version</b> nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130 <b> nvcc warp.cu
./a.out</b> thread 0 's val is 0 thread 1 's val is 0 thread 2 's val is 1 thread 3 's val is 2 thread 4 's val is 3 thread 5 's val is 4 thread 6 's val is 5 kernel1 done thread 5 's val is 5 thread 6 's val is 6 thread 0 's val is 0 thread 1 's val is 0 thread 2 's val is 1 thread 3 's val is 2 thread 4 's val is 3 kernel2 done <b>


In the CUDA C Programming guide, it says:

“All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.”
https://docs.nvidia.com/cuda/archive/10.0/cuda-c-programming-guide#warp-description

In kernel2, I expected the test to fail in both GPUs because on line 18, some non-exited threads named in mask, namely tid=5 and tid=6, does not execute this line. But it did not fail.

It also says in the CUDA C Programming guide that:

“For Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp’s active mask.”
https://docs.nvidia.com/cuda/archive/10.0/cuda-c-programming-guide#independent-thread-scheduling-7-x

Therefore, I expected that for 1080 Ti (Pascal), kernel 1 should fail because on line 8, warp’s active mask, which should be 0x0000007F, is not equal to the mask value of 0xFFFFFFFF. But it also did not fail.

So my question is whether the example code I’ve posted is not safe that it only succeeded in this case by luck, or it is a safe code and I’ve misunderstood the programming guide document.

Invoking “undefined behaviour” doesn’t necessarily mean that your computer has to crash at that point. You may also just get a mere (random or less so) number back, as you are experiencing in your example.

“undefined behavior” includes behavior that was expected. whether those expectations were justified or not. Such code “happens to work” – until it doesn’t. As opposed to “works by design and according to specifications”.

Thanks tera and njuffa for your comments.

So, am I right in concluding that:

  1. For pre-Volta architectures, both kernel1 and kernel2 are not safe.
  2. For Volta and later architectures, kernel1 is safe, but kernel2 is not safe.