Hello Everyone,

I want to understand how the thread divergence is caused in an if-else construct ?

Thread divergence is caused if the decision condition is based on thread Index ?
Thread divergence is caused if the decision condition is based on block Index ?
or, both of above.

The kernel below has an if-else construct depending on block Index.

The kernel (which is not optimized) has been launched with following kernel parameters:

``````dim3 block( 1, 256);
dim3 grid( 2048, 8 );

dimx =2048;dimy=2048;
kernel_A<<<grid,block>>>( d_data, dimx,dimy, niterations );
``````

global void kernel_A( float g_data, int dimx, int dimy, int niterations )
{
int ix = blockIdx.x;
int iy = blockIdx.y
int idx = iy*dimx + ix;

``````float value = g_data[idx];

if( ix % 2 )
{
for(int i=0; i<niterations; i++)
{
value += sqrtf( logf(value) + 1.f );
}
}
else
{
for(int i=0; i<niterations; i++)
{
value += sqrtf( cosf(value) + 1.f );
}
}

g_data[idx] = value;
``````

}

Thanks and Regards

A test based on blockIdx won’t cause warp divergence because it will be constant for every thread in any given running warp of threads.

Here is a complete example of warp divergence. This uses Ocelot to execute the program rather than CUDA (there may be differences with how it is implemented in NVIDIA GPUs, but at least this is how it is understood to happen in the academic world)

Take the following very simple example:

``````int a = 2;

{

int b = a + 1;

{

}

}
``````

Assume that this is compiled with no optimization, and evaluated in a straightforward manner. You might get a binary that resembles this:

``````.version 1.4

.entry warpTest()

{

.reg .u32 %r<10>;

.reg .u32 %tid;

.reg .pred %pred;

Entry:

mov.u32 %r0, 2;

mov.u32 %r2, 1;

cvt.u32.u16 %tid, %tid.x;

setp.ge.u32 %pred, %tid, %r0;

@%pred bra Exit;

setp.ge.u32 %pred, %tid, %r2;

@%pred bra Exit;

Exit:

exit;

}
``````

Now assume that you ran this example with one warp, with only four threads active.

This produces the following execution trace which indicates how each instruction is executed.

``````Running kernel warpTest

 - mov.u32 %r0, 2 - 

 - mov.u32 %r1, 1 - 

 - cvt.u32.u16 %r2, %tid.x - 

 - setp.ge.u32 %p3, %r2, %r0 - 

 - @%p3 bra \$BB_1_3 - 

taken: 

fall-through: 

reconverge: 

 - add.u32 %r4, %r0, 1 - 

 - setp.ge.u32 %p3, %r2, %r1 - 

 - @%p3 bra \$BB_1_3 - 

taken: 

fall-through: 

reconverge: 

 - reconverge - 

 - reconverge - 

 - reconverge - 

 - reconverge - 

 - exit - 

Finished kernel
``````

In this example, the predicate mask [xxxx] indicates which threads execute the instruction. A mask of  indicates that the first two threads execute the instruction, but not the second two. In this example, the first two threads skip the if-then-else completely, while the second two go into the ‘then’ path. From there, only the last thread falls into the ‘then’ path of the second if-then-else. The if-then-else gets compiled into a series of conditional branches, which are evaluated differently for different threads. When this happens, three contexts are created, a fallthrough, a target, and a reconverge, whose execution is serialized. When the target and fallthrough contexts have reached the reconverge point (some post-dominator of the branch) execution switches back to the reconverge context, which has the same threads that were active before diverging.

The source code for this example is attached in case you want to download it and experiment yourself:

``````#include <ocelot/api/interface/ocelot.h>

#include <ocelot/cuda/interface/cuda_runtime.h>

#include <ocelot/trace/interface/TraceGenerator.h>

#include <ocelot/trace/interface/TraceEvent.h>

#include <ocelot/executive/interface/ExecutableKernel.h>

#include <iostream>

{

public:

virtual void initialize(const executive::ExecutableKernel& kernel)

{

std::cout << "Running kernel " << kernel.name << "\n";

}

virtual void event(const trace::TraceEvent & event)

{

std::string string;

boost::to_string(event.active, string);

std::cout << " [" << event.PC << "] - "

<< event.instruction->toString() << " - ["

<< string << "]\n";

if(event.instruction->opcode == ir::PTXInstruction::Bra)

{

boost::to_string(event.taken, string);

std::cout << "  taken: [" << string << "]\n";

boost::to_string(event.fallthrough, string);

std::cout << "  fall-through: [" << string << "]\n";

boost::to_string(event.active, string);

std::cout << "  reconverge: [" << string << "]\n";

}

}

virtual void finish()

{

std::cout << "Finished kernel\n";

}

};

int main(int argc, char** argv)

{

std::stringstream ptx;

ptx << ".version 1.4\n";

ptx << ".entry warpTest()\n";

ptx << "{\n";

ptx << "\t.reg .u32 %r<10>;\n";

ptx << "\t.reg .u32 %tid;\n";

ptx << "\t.reg .pred %pred;\n";

ptx << "\tmov.u32 %r0, 2;\n";

ptx << "\tmov.u32 %r2, 1;\n";

ptx << "\tcvt.u32.u16 %tid, %tid.x;\n";

ptx << "\tsetp.ge.u32 %pred, %tid, %r0;\n";

ptx << "\t@%pred bra Exit;\n";

ptx << "\tadd.u32 %r1, %r0, 1;\n";

ptx << "\tsetp.ge.u32 %pred, %tid, %r2;\n";

ptx << "\t@%pred bra Exit;\n";

ptx << "Exit:\n";

ptx << "\texit;\n";

ptx << "}\n";

ocelot::registerPTXModule(ptx, "module");

cudaConfigureCall(dim3(1,1,1), dim3(4, 1, 1), 0, 0);

ocelot::launch("module", "warpTest");

return 0;

}
``````

double post

double double post