# How to combine a normal kernel and reduce kernel inside loop

I have the following code to be optimized in CUDA (with some simplification):

while (it < iterationsd && fabs(delta) > 0.0000002) {
slope = 0.0;
if (tt > 0.0) {
lz = -tt;

``````    /* the following values depends on lz */
z1 = exp(ratxv * lz);
z1zz = exp(rat * lz);
y1 = 1.0 - z1;
z1yy = z1 - z1zz;
z1xv = z1 * xv;

for (i = 0; i < size; i++) {
cc = prod1[i];
bb = prod2[i];
aa = prod3[i];

slope += weightrat[i] * (z1zz * (bb - aa) +    z1xv * (cc - bb)) /    (aa * z1zz + bb * z1yy + cc * y1);
}
}
if (slope < 0.0)
delta = fabs(delta) / -2.0;
else
delta = fabs(delta);
tt += delta;
it++;
``````

}

I change it to code like that:

while (it < iterationsd && fabs(delta) > 0.0000002) {
slope = 0.0;
if (tt > 0.0) {
lz = -tt;

``````    /* the following values depends on lz */
z1 = exp(ratxv * lz);
z1zz = exp(rat * lz);
y1 = 1.0 - z1;
z1yy = z1 - z1zz;
z1xv = z1 * xv;

/* kernel1 calcuate slope[i] like
slope[i] += weightrat[i] * (z1zz * (bb - aa) +    z1xv * (cc - bb)) /    (aa * z1zz + bb * z1yy + cc * y1);
*/
kernel1<<< ... >>>

/* kernel calcuate the sum of slope[0..size] */
kernel2<<< ... >>>
}
if (slope < 0.0)
delta = fabs(delta) / -2.0;
else
delta = fabs(delta);
tt += delta;
it++;
``````

}

Now the problem is that the problem size is not so large (about 10000), so I think the kernels itself executes quite fast, but a lot of time is wasted in communication between host and device.

I am thinking of put this whole loop (while) into a kernel. But I have the following questions:

1. How to sync between different block?
For example, I need to sync all thread at the place after kernel1, then I can start to calculate the sum. But how to do that? Using global memory ? Any good example ?

2. How to implement the sum of all slope[i] efficiently? Basically I need to sum slope[i] in all blocks and then broadcast the sum to each block.

I guess this problem is common, so hopefully you guys can help me. Any hint or guide is appreciated :)

I didn’t read your problem in detail, but the Thrust library is quite good at combining reductions with other user-defined operations.

See the section titled “Fusion” here: