I understand the concept of branch divergence in CUDA applications, but I’m testing an application to see in practice the divergence, and nvprof is giving a number of branches and divergent branches that I’m failing to understand where those numbers came from.

So I have these two kernel blocks:

```
__global__ void mathKernel1(float *c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float ia, ib;
ia = ib = 0.0f;
if (tid % 2 == 0){
ia = 100.0f;
ib = 50.0;
ia = pow(ia,3) * ib;
}
else{
ib = 200.0f;
ia = 15.0f;
ib = pow(ib,3) + ia;
}
c[tid] = ia + ib;
}
```

```
__global__ void mathKernel2(float *c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float ia, ib;
ia = ib = 0.0f;
if ((tid / warpSize) % 2 == 0){
ia = 100.0f;
ib = 50.0;
ia = pow(ia,3) + ib;
}
else{
ib = 200.0f;
ia = 15.0f;
ib = pow(ib,3) + ia;
}
c[tid] = ia + ib;
}
```

Both kernels are being executed by 1 block with 64 threads on a GTX650.

Nvprof is telling me that the first kernel have 22 branches and 2 divergent branch. The second one have 12 branches and 0 divergent branch.

I was expecting 2 divergent branches, as there is 2 warps and Im dividing them by the thread identification, so there’s 2 non divergent branch. What I don’t understand is where those other 20 branches from kernel1 and 12 branches from kernel 2 came from.