Branch Divergence

Dante003 · October 7, 2015, 2:40am

I understand the concept of branch divergence in CUDA applications, but I’m testing an application to see in practice the divergence, and nvprof is giving a number of branches and divergent branches that I’m failing to understand where those numbers came from.

So I have these two kernel blocks:

__global__ void mathKernel1(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    if (tid % 2 == 0){
        ia = 100.0f;
        ib = 50.0;
        ia = pow(ia,3) * ib;
	}
    else{
        ib = 200.0f;
        ia = 15.0f;
        ib = pow(ib,3) + ia;

	}

    c[tid] = ia + ib;
}

__global__ void mathKernel2(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    if ((tid / warpSize) % 2 == 0){
        ia = 100.0f;
        ib = 50.0;
        ia = pow(ia,3) + ib;
	}
    else{
        ib = 200.0f;
        ia = 15.0f;
        ib = pow(ib,3) + ia;
	}

    c[tid] = ia + ib;
}

Both kernels are being executed by 1 block with 64 threads on a GTX650.
Nvprof is telling me that the first kernel have 22 branches and 2 divergent branch. The second one have 12 branches and 0 divergent branch.
I was expecting 2 divergent branches, as there is 2 warps and Im dividing them by the thread identification, so there’s 2 non divergent branch. What I don’t understand is where those other 20 branches from kernel1 and 12 branches from kernel 2 came from.

Robert_Crovella · October 7, 2015, 2:51am

The behavior that nvprof is reporting on here, is the behavior of the SASS, not your C source code. For a kernel as simple as this, you can probably connect the two, but in general it may be difficult to do.

Also, nvprof really only knows about “divergence” at the warp level. If two separate warps go down separate code paths, that is not measurably “divergent”.

Your second case has no divergence because this operation:

if ((tid / warpSize) % 2 == 0){

causes all threads in a warp to go down only one of the two possible paths. Either all threads in the warp go down the if path, or all threads in the warp go down the else path. Therefore there is no measured “divergence”.

Dante003 · October 7, 2015, 2:59am

txbob:

The behavior that nvprof is reporting on here, is the behavior of the SASS, not your C source code. For a kernel as simple as this, you can probably connect the two, but in general it may be difficult to do.

Also, nvprof really only knows about “divergence” at the warp level. If two separate warps go down separate code paths, that is not measurably “divergence”.

Your second case has no divergence because this operation:
if ((tid / warpSize) % 2 == 0){
causes all threads in a warp to go down only one of the two possible paths. Either all threads in the warp go down the if path, or all threads in the warp go down the else path. Therefore there is no measured “divergence”.

So what I consider as a branch in a C code is not necessarily represented in SASS code. In C code what I see are two branches for each warp, the branch 1 that gets the lines 3~11 and line 18 too, and the branch 2 that is the else statement, lines 13~15.
Then in SASS code that part that just seems sequential to me is actually represented as many branches?

Is there a way to see the branch numbers of the C code?

njuffa · October 7, 2015, 4:09am

There is no general rule on how branches at the HLL level will be represented at the machine language level. This applies to CUDA just as it applies to any commonly use CPU.

The CUDA tool chain often applies well-known compiler optimizations such as if-conversion that turns a branch body into predicated code, or it may employ select-type instructions which are the machine equivalent of C/C++ ternary operator to avoid branches. Conversely, a branch at HLL level with a compound branch condition may really result in multiple branch instructions based on C/C++ short-circuit evaluation of boolean ANDs and ORs. So you really would want inspect the machine code, called SASS in the GPU context. You can use cuobjdump --dump-sass to extract the machine code from an executable.

Branches on the GPU can only be divergent if the threads within the same warp execute physically different code paths (i.e. sequences of instruction addresses). All threads in a warp share a single program counter, and as long as the threads in a warp execute the same instruction sequence including branches, there is no divergence.

Topic		Replies	Views
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3965	December 21, 2016
Shift direction and divergence CUDA Programming and Performance	7	388	November 13, 2020
Branch / Divergent branches CUDA Programming and Performance	4	15544	March 30, 2010
SASS divergence prevention of non-divergent code CUDA Programming and Performance	5	1670	October 1, 2022
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18571	December 13, 2007
cant call any kernel function CUDA Programming and Performance	8	4843	June 6, 2011
divergent branches Not sure where they are occurring CUDA Programming and Performance	5	1274	February 10, 2011
reduction optimization #1 Not agree with performances explanation CUDA Programming and Performance	8	6683	August 1, 2008
Single Branch Divergence? [beginner question] CUDA Programming and Performance	3	1161	January 6, 2016
Branch or not CUDA Programming and Performance	7	2882	February 28, 2018

Branch Divergence

Related topics