I have the following code to be optimized in CUDA (with some simplification):
while (it < iterationsd && fabs(delta) > 0.0000002) {
slope = 0.0;
if (tt > 0.0) {
lz = tt;
/* the following values depends on lz */
z1 = exp(ratxv * lz);
z1zz = exp(rat * lz);
y1 = 1.0  z1;
z1yy = z1  z1zz;
z1xv = z1 * xv;
for (i = 0; i < size; i++) {
cc = prod1[i];
bb = prod2[i];
aa = prod3[i];
slope += weightrat[i] * (z1zz * (bb  aa) + z1xv * (cc  bb)) / (aa * z1zz + bb * z1yy + cc * y1);
}
}
if (slope < 0.0)
delta = fabs(delta) / 2.0;
else
delta = fabs(delta);
tt += delta;
it++;
}
I change it to code like that:
while (it < iterationsd && fabs(delta) > 0.0000002) {
slope = 0.0;
if (tt > 0.0) {
lz = tt;
/* the following values depends on lz */
z1 = exp(ratxv * lz);
z1zz = exp(rat * lz);
y1 = 1.0  z1;
z1yy = z1  z1zz;
z1xv = z1 * xv;
/* kernel1 calcuate slope[i] like
slope[i] += weightrat[i] * (z1zz * (bb  aa) + z1xv * (cc  bb)) / (aa * z1zz + bb * z1yy + cc * y1);
*/
kernel1<<< ... >>>
/* kernel calcuate the sum of slope[0..size] */
kernel2<<< ... >>>
}
if (slope < 0.0)
delta = fabs(delta) / 2.0;
else
delta = fabs(delta);
tt += delta;
it++;
}
Now the problem is that the problem size is not so large (about 10000), so I think the kernels itself executes quite fast, but a lot of time is wasted in communication between host and device.
I am thinking of put this whole loop (while) into a kernel. But I have the following questions:

How to sync between different block?
For example, I need to sync all thread at the place after kernel1, then I can start to calculate the sum. But how to do that? Using global memory ? Any good example ? 
How to implement the sum of all slope[i] efficiently? Basically I need to sum slope[i] in all blocks and then broadcast the sum to each block.
I guess this problem is common, so hopefully you guys can help me. Any hint or guide is appreciated :)