Thread Divergence

satyam_shivam · May 29, 2010, 5:52am

Hello Everyone,

I want to understand how the thread divergence is caused in an if-else construct ?

Thread divergence is caused if the decision condition is based on thread Index ?
Thread divergence is caused if the decision condition is based on block Index ?
or, both of above.

The kernel below has an if-else construct depending on block Index.

The kernel (which is not optimized) has been launched with following kernel parameters:

dim3 block( 1, 256);
dim3 grid( 2048, 8 );

    dimx =2048;dimy=2048;
kernel_A<<<grid,block>>>( d_data, dimx,dimy, niterations );

global void kernel_A( float g_data, int dimx, int dimy, int niterations )
{
int ix = blockIdx.x;
int iy = blockIdx.yblockDim.y + threadIdx.y;
int idx = iy*dimx + ix;

float value = g_data[idx];

if( ix % 2 )
{
	for(int i=0; i<niterations; i++)
	{
		value += sqrtf( logf(value) + 1.f );
	}
}
else
{
	for(int i=0; i<niterations; i++)
	{
		value += sqrtf( cosf(value) + 1.f );
	}
}

g_data[idx] = value;

}

Thanks and Regards

avidday · May 29, 2010, 7:30am

A test based on blockIdx won’t cause warp divergence because it will be constant for every thread in any given running warp of threads.

Gregory_Diamos · May 29, 2010, 4:33pm

Here is a complete example of warp divergence. This uses Ocelot to execute the program rather than CUDA (there may be differences with how it is implemented in NVIDIA GPUs, but at least this is how it is understood to happen in the academic world)

Take the following very simple example:

int a = 2;

if(threadIdx.x > 1)

{

  int b = a + 1;

  if(threadIdx.x > 0)

  {

}

}

Assume that this is compiled with no optimization, and evaluated in a straightforward manner. You might get a binary that resembles this:

.version 1.4

	.entry warpTest()

	{

		.reg .u32 %r<10>;

		.reg .u32 %tid;

		.reg .pred %pred;

		

		Entry:

			mov.u32 %r0, 2;

			mov.u32 %r2, 1;

			cvt.u32.u16 %tid, %tid.x;

			setp.ge.u32 %pred, %tid, %r0;

			@%pred bra Exit;

			add.u32 %r1, %r0, 1;

			setp.ge.u32 %pred, %tid, %r2;

			@%pred bra Exit;

		Exit:

			exit;

	}

Now assume that you ran this example with one warp, with only four threads active.

This produces the following execution trace which indicates how each instruction is executed.

Running kernel warpTest

 [0] - mov.u32 %r0, 2 - [1111]

 [1] - mov.u32 %r1, 1 - [1111]

 [2] - cvt.u32.u16 %r2, %tid.x - [1111]

 [3] - setp.ge.u32 %p3, %r2, %r0 - [1111]

 [4] - @%p3 bra $BB_1_3 - [1111]

  taken: [1100]

  fall-through: [0011]

  reconverge: [1111]

 [5] - add.u32 %r4, %r0, 1 - [0011]

 [6] - setp.ge.u32 %p3, %r2, %r1 - [0011]

 [7] - @%p3 bra $BB_1_3 - [0011]

  taken: [0010]

  fall-through: [0001]

  reconverge: [0011]

 [8] - reconverge - [0001]

 [8] - reconverge - [0010]

 [9] - reconverge - [0011]

 [8] - reconverge - [1100]

 [10] - exit - [1111]

Finished kernel

In this example, the predicate mask [xxxx] indicates which threads execute the instruction. A mask of [1100] indicates that the first two threads execute the instruction, but not the second two. In this example, the first two threads skip the if-then-else completely, while the second two go into the ‘then’ path. From there, only the last thread falls into the ‘then’ path of the second if-then-else. The if-then-else gets compiled into a series of conditional branches, which are evaluated differently for different threads. When this happens, three contexts are created, a fallthrough, a target, and a reconverge, whose execution is serialized. When the target and fallthrough contexts have reached the reconverge point (some post-dominator of the branch) execution switches back to the reconverge context, which has the same threads that were active before diverging.

The source code for this example is attached in case you want to download it and experiment yourself:

#include <ocelot/api/interface/ocelot.h>

#include <ocelot/cuda/interface/cuda_runtime.h>

#include <ocelot/trace/interface/TraceGenerator.h>

#include <ocelot/trace/interface/TraceEvent.h>

#include <ocelot/executive/interface/ExecutableKernel.h>

#include <iostream>

class MaskGenerator : public trace::TraceGenerator

{

	public:

		virtual void initialize(const executive::ExecutableKernel& kernel)

		{

			std::cout << "Running kernel " << kernel.name << "\n";

		}

		virtual void event(const trace::TraceEvent & event)

		{

			std::string string;

			

			boost::to_string(event.active, string);

			

			std::cout << " [" << event.PC << "] - " 

				<< event.instruction->toString() << " - [" 

				<< string << "]\n";

				

			if(event.instruction->opcode == ir::PTXInstruction::Bra)

			{

				boost::to_string(event.taken, string);

				std::cout << "  taken: [" << string << "]\n";

				boost::to_string(event.fallthrough, string);

				std::cout << "  fall-through: [" << string << "]\n";

				boost::to_string(event.active, string);

				std::cout << "  reconverge: [" << string << "]\n";

			}

		}

		

		virtual void finish()

		{

			std::cout << "Finished kernel\n";

		}

};

int main(int argc, char** argv)

{

	std::stringstream ptx;

	

	ptx << ".version 1.4\n";

	ptx << ".entry warpTest()\n";

	ptx << "{\n";

	ptx << "\t.reg .u32 %r<10>;\n";

	ptx << "\t.reg .u32 %tid;\n";

	ptx << "\t.reg .pred %pred;\n";

	ptx << "\tmov.u32 %r0, 2;\n";

	ptx << "\tmov.u32 %r2, 1;\n";

	ptx << "\tcvt.u32.u16 %tid, %tid.x;\n";

	ptx << "\tsetp.ge.u32 %pred, %tid, %r0;\n";

	ptx << "\t@%pred bra Exit;\n";

	ptx << "\tadd.u32 %r1, %r0, 1;\n";

	ptx << "\tsetp.ge.u32 %pred, %tid, %r2;\n";

	ptx << "\t@%pred bra Exit;\n";

	ptx << "Exit:\n";

	ptx << "\texit;\n";

	ptx << "}\n";

	

	MaskGenerator gen;

	

	ocelot::addTraceGenerator(gen);

	ocelot::registerPTXModule(ptx, "module");

	

	cudaConfigureCall(dim3(1,1,1), dim3(4, 1, 1), 0, 0);

	ocelot::launch("module", "warpTest");

	return 0;	

}

Gregory_Diamos · May 29, 2010, 4:35pm

double post

Gregory_Diamos · May 29, 2010, 4:45pm

double double post

satyam_shivam · June 1, 2010, 11:09am

Thanks a lot for your reply !

Topic		Replies	Views
Is there thread divergence when all threads in a warp execute the same branch? General	0	612	September 19, 2021
Thread divergence due to IF CUDA Programming and Performance	3	6912	September 13, 2007
does a switch statement by thread id cause divergence CUDA Programming and Performance	5	3285	January 7, 2011
Avoid branching ... CUDA Programming and Performance	3	3682	May 19, 2010
Thread divergence when block size is equal to warp size CUDA Programming and Performance	2	647	June 5, 2019
Avoiding thread divergence CUDA Programming and Performance	5	4039	December 12, 2014
Question about divergence and branch granularity CUDA Programming and Performance	1	916	April 25, 2012
threads diverging in a loop when does a loop cause divergance CUDA Programming and Performance	13	21099	May 12, 2007
Single Branch Divergence? [beginner question] CUDA Programming and Performance	3	1222	January 6, 2016
Warp branching CUDA Programming and Performance	11	10391	October 26, 2010

Thread Divergence

Related topics