I’m learning CUDA and getting familiar with the Visual Studio NSight Performance Analysis tools. I implemented a very naive Sobel edge finder. The goad was to make it work, and then use the Performance Analysis to improve performance.

I ran the Performance Analysis tool with all of the Source experiments selected (Instruction Count, Divergent Branch, and Memory Transfer). The Divergent Branch showed some branches with 0, 0.6, 18, etc. all pointing to:

```
__MATH_FUNCTIONS_DBL_PTX3_DECL__ double pow(double a, double b)
{
return __nv_pow(a,b);
}
```

in math_functions_dbl_ptx3.hpp.

In my kernel, I was using the Pythagorean theorem like so:

```
.
.
.
int newPixelx = 0;
int newPixely = 0;
for (unsigned char y = 0; y < 3; y++)
{
for (unsigned char x = 0; x < 3; x++)
{
newPixelx += (Gx[x][y] * subImage[x][y]);
newPixely += (Gy[x][y] * subImage[x][y]);
}
}
double newPixel = sqrt( (pow( (double)newPixelx, 2.0 ) + pow( (double)newPixely, 2.0 )) );
```

I remove the ‘.0’ from the '2’s in line 14:

```
double newPixel = sqrt( (pow( (double)newPixelx, 2 ) + pow( (double)newPixely, 2 )) );
```

And now there’s 100% Branches Efficiency, with no divergence in math_functions_dbl_ptx3.hpp. Those calls aren’t showing up in the Divergent Branch results. Also, kernel execution time (on a 1920x1200 image) dropped from ~30ms to 1ms, with the same grid and block size.

What could cause such major divergence between calling pow with ‘2.0’ vs ‘2’?

System Specs:

Windwos 7 Pro 64 bit.

GTX 750ti

NSight 4.6.0.15071

Driver 347.62