Warp divergence in independent thread scheduling?

Hello,

I have produce a simple kernel to study warp divergence. I am trying for force the first 16 threads of a warp to do something different from the last 16 threads of the warp. The idea is that if this leads to serialization of the first and second group of 16 threads, the run time should be double to that of when there is no conditional statement.

__global__ void divergence2(){
   double v = 0.0;
        if (threadIdx.x < 16){
                for(int i = 0; i < 10000; i++)
                        for (int j = i; j < i * i ; j++)
                                v = v + i + i + j / 23.324;
        }else{
                for(int i = 0; i < 10000; i++)
                        for (int j = i; j < i * i ; j++)
                                v = v + i - i -  j / 23.324;
        }
}

I am launching this kernel as

        divergence2<<< 1000 , 32 >>>(); 

I am running this on RTX 1080, RTX 2080 and on RTX A5000.

I am comparing this with a situation without divergence

__global__ void NOdivergence(){
   double v = 0.0;
  
   for(int i = 0; i < 10000; i++)
      for (int j = i; j < i * i ; j++)
         v = v + i - i -  j / 23.324;
  }

which I launch using

        NOdivergence<<< 1000 , 32 >>>(); 

On the RTX 1080, the run times fluctuate between the kernel WITHOUT the if statement being 2x slower and 1x slower. On average the kernel WITHOUT the if statement is about 1.23x slower.

On the RTX 2080, the run times fluctuate between the kernel WITHOUT the if statement being 2x slower and 1x slower. On average the kernel WITHOUT the if statement is about 1.46x slower.

On the RTX A5000, the run times fluctuate more than on the RTX 2080. On average the kernel WITHOUT the if statement is about 1.48x slower.

Questions:

  1. Do b oth RXT 2080 and A5000 have ITS?
  2. Is the Independent Thread Schedule always active, or is it something we can turn off/on?
  3. If on, does this mean that there should be NO warp divergence and hence that the run times should be the same? Why do I get that the conditional statement makes the code run faster, not slower?
  4. Is it possible to create a simple example where a warp divergence on a device with IDS leads to substantial slow down? Say a 2x slow down?
  1. Yes. All GPUs of Volta family or newer have the volta thread execution model (independent thread scheduling).
  2. It is always active, you cannot disable it. (You might be able to disable it for Volta architecture on some CUDA versions, by compiling for an arch less than 7.0, but this is something I would not rely on, and it would limit you from doing the right thing in terms of compilation strategy).
  3. Warp divergence may still have a cost.
  4. Here is an example.