Thread Divergence, branches, examples

I have some examples which give me some strange headaches:

I produce a thread divergence, but I cannot figure out which branch or which statements are computed first?

First example:

I have the following kernel, which I start by 2 threads in 1 block.

with a[0]=0, and a1=0.

__global__ void branchTest_kernel( float* a){

int tx = threadIdx.x;

if(tx==0){                   // or tx==1

         a[1] = a[0] + 1;  (a)

      }else if(tx==1){             // or tx==0

         a[0] = a[1] + 1;;         (b)

      }

    }

[b]Output

a[0] = 1

a1 = 2 [/b]

I assum that because the two threads are in on warp, they execute in lockstep and is that the reason why they get serialized that (b) is first and (a) second, but why this way around? (b) is precedent also if first if -> tx==1 and second else if is tx==0…

Second example:

Exactly the same as the first but, now removed the else if part:

__global__ void branchTest_kernel( float* a){

int tx = threadIdx.x;

if(tx==0){

         a[1] = a[0] + 1;  (a)

      }else{

         a[0] = a[1] + 1;  (b)

      }

}

[b]Output

a[0] = 2

a1 = 1[/b]

What causes this behaviour that suddenly now (a) is first, and (b) second…

Can somebody explain how the precendence rules are for branches? Or where to find such information?

Thanks alot!

I encountered this example during an implementation of a Gauss-Seidel Solver:

Gauss Seidel See Figure 3, (a) diagonal block

i think only that
on exemple 1 tread 1 run before thread 0
on exemple 2 tread 0 run before thread 1

why i dont know but writing one thing that need the compute of other thread before is not good

I test your code on GTX480 with CUDA4.0RC2 and compile code with -arch=sm_20

example 1:

A[0] = 1

A[1] = 2

example 2:

A[0] = 1

A[1] = 1

I modify example 1, disable warp synchronization by a fro-loop.

this can guarantee order of execution. I call it as example 3

example 3

__global__ void branchTest_kernel( float* a)

{

    int tx = threadIdx.x;

    for(int i = 0 ; i < 32 ; i++){

        if ( tx != i ){ continue ;}

        if(tx==0){                   // or tx==1

            a[1] = a[0] + 1;

        }else if(tx==1){             // or tx==0

            a[0] = a[1] + 1;;

        }

    }

}

example 3:

A[0] = 2

A[1] = 1

Only example 3 can generate expected result. You should be carefully on warp synchronization.

If you look at assembly code of example 2, then nvcc generate LDU to load A, i.e. load A from constant cache,

that is why A[0] = A[1] = 1.

Why does it give you headaches? The order in which threads execute is undefined, so you should not rely on it. It might be interesting for educational purposes, or personal entertainment. But it should be irrelevant for practical purposes.

And if you want to figure out anything, look at disassembled binaries. Staring at source code and guessing what the optimizer may have transformed it to seems futile.