Hello,
I have produce a simple kernel to study warp divergence. I am trying for force the first 16 threads of a warp to do something different from the last 16 threads of the warp. The idea is that if this leads to serialization of the first and second group of 16 threads, the run time should be double to that of when there is no conditional statement.
__global__ void divergence2(){
double v = 0.0;
if (threadIdx.x < 16){
for(int i = 0; i < 10000; i++)
for (int j = i; j < i * i ; j++)
v = v + i + i + j / 23.324;
}else{
for(int i = 0; i < 10000; i++)
for (int j = i; j < i * i ; j++)
v = v + i - i - j / 23.324;
}
}
I am launching this kernel as
divergence2<<< 1000 , 32 >>>();
I am running this on RTX 1080, RTX 2080 and on RTX A5000.
I am comparing this with a situation without divergence
__global__ void NOdivergence(){
double v = 0.0;
for(int i = 0; i < 10000; i++)
for (int j = i; j < i * i ; j++)
v = v + i - i - j / 23.324;
}
which I launch using
NOdivergence<<< 1000 , 32 >>>();
On the RTX 1080, the run times fluctuate between the kernel WITHOUT the if statement being 2x slower and 1x slower. On average the kernel WITHOUT the if statement is about 1.23x slower.
On the RTX 2080, the run times fluctuate between the kernel WITHOUT the if statement being 2x slower and 1x slower. On average the kernel WITHOUT the if statement is about 1.46x slower.
On the RTX A5000, the run times fluctuate more than on the RTX 2080. On average the kernel WITHOUT the if statement is about 1.48x slower.
Questions:
- Do b oth RXT 2080 and A5000 have ITS?
- Is the Independent Thread Schedule always active, or is it something we can turn off/on?
- If on, does this mean that there should be NO warp divergence and hence that the run times should be the same? Why do I get that the conditional statement makes the code run faster, not slower?
- Is it possible to create a simple example where a warp divergence on a device with IDS leads to substantial slow down? Say a 2x slow down?