Warp divergence in independent thread scheduling?

molosed393 · September 6, 2021, 7:34pm

Hello,

I have produce a simple kernel to study warp divergence. I am trying for force the first 16 threads of a warp to do something different from the last 16 threads of the warp. The idea is that if this leads to serialization of the first and second group of 16 threads, the run time should be double to that of when there is no conditional statement.

__global__ void divergence2(){
   double v = 0.0;
        if (threadIdx.x < 16){
                for(int i = 0; i < 10000; i++)
                        for (int j = i; j < i * i ; j++)
                                v = v + i + i + j / 23.324;
        }else{
                for(int i = 0; i < 10000; i++)
                        for (int j = i; j < i * i ; j++)
                                v = v + i - i -  j / 23.324;
        }
}

I am launching this kernel as

        divergence2<<< 1000 , 32 >>>();

I am running this on RTX 1080, RTX 2080 and on RTX A5000.

I am comparing this with a situation without divergence

__global__ void NOdivergence(){
   double v = 0.0;
  
   for(int i = 0; i < 10000; i++)
      for (int j = i; j < i * i ; j++)
         v = v + i - i -  j / 23.324;
  }

which I launch using

        NOdivergence<<< 1000 , 32 >>>();

On the RTX 1080, the run times fluctuate between the kernel WITHOUT the if statement being 2x slower and 1x slower. On average the kernel WITHOUT the if statement is about 1.23x slower.

On the RTX 2080, the run times fluctuate between the kernel WITHOUT the if statement being 2x slower and 1x slower. On average the kernel WITHOUT the if statement is about 1.46x slower.

On the RTX A5000, the run times fluctuate more than on the RTX 2080. On average the kernel WITHOUT the if statement is about 1.48x slower.

Questions:

Do b oth RXT 2080 and A5000 have ITS?
Is the Independent Thread Schedule always active, or is it something we can turn off/on?
If on, does this mean that there should be NO warp divergence and hence that the run times should be the same? Why do I get that the conditional statement makes the code run faster, not slower?
Is it possible to create a simple example where a warp divergence on a device with IDS leads to substantial slow down? Say a 2x slow down?

Robert_Crovella · September 7, 2021, 1:25am

Yes. All GPUs of Volta family or newer have the volta thread execution model (independent thread scheduling).
It is always active, you cannot disable it. (You might be able to disable it for Volta architecture on some CUDA versions, by compiling for an arch less than 7.0, but this is something I would not rely on, and it would limit you from doing the right thing in terms of compilation strategy).
Warp divergence may still have a cost.
Here is an example.

Topic		Replies	Views
Question about warp execution and the warp scheduler CUDA Programming and Performance	5	522	July 5, 2025
Diverge-free doesn't win 32x over Diverge-all warp divergence CUDA Programming and Performance	6	3208	September 14, 2007
Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction? CUDA Programming and Performance	4	936	January 8, 2020
Avoid branching ... CUDA Programming and Performance	3	3687	May 19, 2010
question about the warp divergence CUDA Programming and Performance	3	743	July 30, 2017
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3048	December 28, 2008
Does the new independent thread scheduling give better performance? CUDA Programming and Performance	4	3416	February 6, 2020
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	17019	January 5, 2008
Thread Divergence CUDA Programming and Performance	5	2836	June 1, 2010
does a switch statement by thread id cause divergence CUDA Programming and Performance	5	3289	January 7, 2011

Warp divergence in independent thread scheduling?

Related topics