How to combine a normal kernel and reduce kernel inside loop

I have the following code to be optimized in CUDA (with some simplification):

while (it < iterationsd && fabs(delta) > 0.0000002) {
slope = 0.0;
if (tt > 0.0) {
lz = -tt;

    /* the following values depends on lz */
    z1 = exp(ratxv * lz);
    z1zz = exp(rat * lz);
    y1 = 1.0 - z1;
    z1yy = z1 - z1zz;
    z1xv = z1 * xv;
    
    for (i = 0; i < size; i++) {            
        cc = prod1[i];
        bb = prod2[i];
        aa = prod3[i];
        
        slope += weightrat[i] * (z1zz * (bb - aa) +    z1xv * (cc - bb)) /    (aa * z1zz + bb * z1yy + cc * y1);
    }
}
if (slope < 0.0)
    delta = fabs(delta) / -2.0;
else
    delta = fabs(delta);
tt += delta;
it++;

}

I change it to code like that:

while (it < iterationsd && fabs(delta) > 0.0000002) {
slope = 0.0;
if (tt > 0.0) {
lz = -tt;

    /* the following values depends on lz */
    z1 = exp(ratxv * lz);
    z1zz = exp(rat * lz);
    y1 = 1.0 - z1;
    z1yy = z1 - z1zz;
    z1xv = z1 * xv;
    
    /* kernel1 calcuate slope[i] like
       slope[i] += weightrat[i] * (z1zz * (bb - aa) +    z1xv * (cc - bb)) /    (aa * z1zz + bb * z1yy + cc * y1);
    */        
    kernel1<<< ... >>> 
    
    /* kernel calcuate the sum of slope[0..size] */
    kernel2<<< ... >>>
}
if (slope < 0.0)
    delta = fabs(delta) / -2.0;
else
    delta = fabs(delta);
tt += delta;
it++;

}

Now the problem is that the problem size is not so large (about 10000), so I think the kernels itself executes quite fast, but a lot of time is wasted in communication between host and device.

I am thinking of put this whole loop (while) into a kernel. But I have the following questions:

  1. How to sync between different block?
    For example, I need to sync all thread at the place after kernel1, then I can start to calculate the sum. But how to do that? Using global memory ? Any good example ?

  2. How to implement the sum of all slope[i] efficiently? Basically I need to sum slope[i] in all blocks and then broadcast the sum to each block.

I guess this problem is common, so hopefully you guys can help me. Any hint or guide is appreciated :)

I didn’t read your problem in detail, but the Thrust library is quite good at combining reductions with other user-defined operations.

See the section titled “Fusion” here:
http://code.google.com/p/thrust/downloads/…To%20Thrust.pdf

You cannot easily sync between blocks, but you can integrate part of the sum in kernel1:

  • assuming that the block size of kernel1 is BS1, each block in kernel1 compute the slopes and do a partial sum of its own slopes in shared memory

  • after you still need a reduction kernel to compute the sum of the previous partial sum, but the size is size of slope/BS1

In this way you move part of the parallel sum in your first kernel.