divergent branches Not sure where they are occurring

j1mckennon · February 9, 2011, 8:35pm

Hi Guys,

New to the forum. I’m using the text version of the cudaprofiler and I have a routine that the profiler is telling me there are 116 divergent branches in (really bad). The problem is, is that I cannot locate exactly where this is occurring since I’m not using any if-then-else statements or things of that nature. Can anyone help me point this out?

__global__ void

kernel_rhs2 (double *qre, double *qim, double *pre, double *pim,

	     double *theta, double *r_c)

{

  int b;

  int a;

b = blockIdx.y * blockDim.y + threadIdx.y;

  a = blockIdx.x * blockDim.x + threadIdx.x;

double delta, r1, r2, a2, ctheta, stheta, sigma2, db2dx, ctan, csec;

double qrey = 0.0;

  double qreyy = 0.0;

  double qimy = 0.0;

  double qimyy = 0.0;

double ll_re, ll_im;

a2 = aa * aa;

	ctheta = cos (theta[b]);

	stheta = sin (theta[b]);

	ctan = ctheta / stheta;

	csec = 1.0 / stheta;

	r1 = r_c[a];

	r2 = r1 * r1;

	delta = r2 - 2.0 * mass * r1 + a2;

	sigma2 = (r2 + a2) * (r2 + a2) - a2 * delta * stheta * stheta;

	double bb = (r2 + a2) / sqrt (sigma2);

	double cc = delta / sigma2;

	double a_re = 2.0 * ss * (r1 * delta - mass * (r2 - a2)) / sigma2;

	double a_im = (4.0 * mass * aa * r1 * mm

		       + 2.0 * ss * aa * delta * ctheta) / sigma2;

	db2dx = delta * pow (bb, 2) / (r2 + a2)

	  * (4.0 * r1 / (r2 + a2) - 1.0 / sigma2 *

	     (4.0 * r1 * (r2 + a2) -

	      a2 * pow (stheta, 2) * (2.0 * r1 - 2.0 * mass)));

	double d_re =

	  -2.0 / r1 / sigma2 * (r1 * ss * (mass - r1) * (r2 + a2) -

				(3.0 * a2 + 4.0 * r2) * delta) +

	  2.0 * bb * ss / sigma2 * (mass * (a2 - r2) + r1 * delta) -

	  0.5 * db2dx;

	double d_im =

	  2.0 * aa * mm * (r2 + a2) / sigma2 +

	  2.0 * bb * aa / sigma2 * (2.0 * mm * mass * r1 +

				    ss * delta * ctheta);

	double v_re = delta / r2 / sigma2 * (6.0 * mass * r1 * (ss + 1.0) -

					     r2 * (7.0 * ss + 6.0) -

					     6.0 * delta +

					     r2 * pow (ss * ctan + mm * csec,

						       2));

	double v_im =

	  2.0 * aa * mm / r1 / sigma2 * (2.0 * r1 * ss * (mass - r1) -

					 3.0 * delta);

	qrey =

	  (qre_c[idx (b + 1, a)] - qre_c[idx (b - 1, a)]) / (2.0 * dtheta);

	qimy =

	  (qim_c[idx (b + 1, a)] - qim_c[idx (b - 1, a)]) / (2.0 * dtheta);

	qreyy =

	  (qre_c[idx (b + 1, a)] + qre_c[idx (b - 1, a)] -

	   2.0 * qre_c[idx (b, a)]) / (dtheta * dtheta);

	qimyy =

	  (qim_c[idx (b + 1, a)] + qim_c[idx (b - 1, a)] -

	   2.0 * qim_c[idx (b, a)]) / (dtheta * dtheta);

	ll_re = qreyy + ctan * qrey;

	ll_im = qimyy + ctan * qimy;

	double qrex = (qre_h[idx (b, a + 1)] - qre_h[idx (b, a)]) / dx;

	double qimx = (qim_h[idx (b, a + 1)] - qim_h[idx (b, a)]) / dx;

	double prex = (pre_h[idx (b, a + 1)] - pre_h[idx (b, a)]) / dx;

	double pimx = (pim_h[idx (b, a + 1)] - pim_h[idx (b, a)]) / dx;

	rhs_qre[idx (b, a)] = -bb * qrex + pre_c[idx (b, a)];

	rhs_qim[idx (b, a)] = -bb * qimx + pim_c[idx (b, a)];

	rhs_pre[idx (b, a)] = tred[b][a] +

	  bb * prex

	  + cc * ll_re

	  + d_re * qrex - d_im * qimx

	  - a_re * pre_c[idx (b, a)] + a_im * pim_c[idx (b, a)]

	  - v_re * qre_c[idx (b, a)] + v_im * qim_c[idx (b, a)];

	rhs_pim[idx (b, a)] = timd[b][a] +

	  bb * pimx

	  + cc * ll_im

	  + d_re * qimx + d_im * qrex

	  - a_re * pim_c[idx (b, a)] - a_im * pre_c[idx (b, a)]

	  - v_re * qim_c[idx (b, a)] - v_im * qre_c[idx (b, a)];

}

Gregory_Diamos · February 9, 2011, 10:29pm

I would bet money on your use of sin/cos/pow functions. These do not get mapped to fast hardware instrinsics for double precision. If you can tolerate the loss of precision, you might consider casting down to floats for them and turning on use_fast_math.

SPWorley · February 9, 2011, 11:20pm

But nono, don’t worry about it. Divergent branches in tiny functions like sin() are not that bad. It’s easy to think “Oh! I must eliminate divergence for best performance!!!” but really you should only be concerned when you have threads diverged for a long (hundreds of instructions or more) subroutine.

The double precision transendental code has “divergences” which are tiny and minor, things like testing if a range reduction to -PI/2 to PI/2 is necessary, or perhaps a NaN escape clause.

If you’re looking for some speed boosts in your code, you might look at storing 1/sigma2 and using that as a multiplier in the many places you’re currently dividing.

mfatica · February 10, 2011, 4:09pm

Few other changes that will improve speed are:

combine

ctheta = cos (theta[b]);
stheta = sin (theta[b]);

into

sincos (theta[b], &stheta, &ctheta);
Get rid of the pow(.,2) and use explicit multiply ( probably 7x faster)
As SPWorley already mentioned

nstead of

double bb = (r2 + a2) / sqrt (sigma2);

use

 double bb = (r2 + a2) * rsqrt (sigma2);

SPWorley · February 10, 2011, 5:17pm

More speedups (these microopts are fun…)

Precalculate idx (b, a) . Switch your input data from 4 arrays to an array of structures (so the read values are all local to each other).

Look at each line and see if there are algebraic reorderings to save computes. Like:

db2dx = delta * pow (bb, 2) / (r2 + a2)

          * (4.0 * r1 / (r2 + a2) - 1.0 / sigma2 *

             (4.0 * r1 * (r2 + a2) -

              a2 * pow (stheta, 2) * (2.0 * r1 - 2.0 * mass)));

// change to

db2dx = 4.0 * r1 / (r2 + a2) *( delta * bb*bb / (r2 + a2) - 1.0 / sigma2) -2.0*a2 * stheta*stheta * (r1 - mass)));

njuffa · February 10, 2011, 7:11pm

A very general observation is that the code’s performance seesm to be dominated by double-precision divisions. In order to achieve speedup, these could be reduced through both extraction of common subexpressions and conversion into multiplies, if numerical properties allow this. This caveat is also the reason the compiler doesn’t apply such transformations on it’s own, but a programmer with knowledge of possible operand ranges and the numerical properties of the code may well do so by hand. Note that this applies equally to C/C++ code on the CPU and the GPU.

One example that applies to this code is the transformation: a / b / c ==> a / (b * c)

Topic		Replies	Views
Unexplained Output from Compute Visual Profiler CUDA Programming and Performance	8	1906	July 30, 2013
__device__ functions create divergent branches CUDA Programming and Performance	6	988	January 31, 2012
divergent branch CUDA Programming and Performance	1	1552	September 9, 2009
powf causing increase of divergent branches CUDA Programming and Performance	2	1000	May 7, 2010
Branch / Divergent branches CUDA Programming and Performance	4	15555	March 30, 2010
Branch Divergence CUDA Programming and Performance	3	6203	October 7, 2015
Profiler output not consistent CUDA Programming and Performance	2	1064	March 6, 2010
strange branch count reported by visual profiler CUDA Programming and Performance	3	3482	October 8, 2010
Estimation of #divergent branch divergent branch : [b] too big [/b CUDA Programming and Performance	0	1043	August 10, 2009
Branch or not CUDA Programming and Performance	7	2909	February 28, 2018

divergent branches Not sure where they are occurring

Related topics