Programming Problem: force the inner loop run as sequential

cwu87488 · September 5, 2016, 2:37am

The code is as follows,

    double ftkv[4], f_h[Q];
    int i, j, k, n;
    double F;
    double ftk, rhot, ut0, ut1, uu, eu;

     #pragma acc data present(f_star, f_xbud, f_xblr, L_dh, e, W, vn, dsx, dsy)
     #pragma acc kernels
     #pragma acc loop independent
    for (i = 1; i <= NX; ++i)
    {
         #pragma acc loop independent
        for (j = 1; j <= NY; ++j)
        {
            rhot = 0;
            ut0 = 0;
            ut1 = 0;

             #pragma acc loop private(ftkv, F, n, ftk, k)
            for (k = 0; k < Q; ++k)
            {
                ftkv[0]= f_xblr[k][i    ][j - 1];
                ftkv[2]= f_xblr[k][i - 1][j - 1];
                ftkv[1]= f_xbud[k][i - 1][j    ];
                ftkv[3]= f_xbud[k][i - 1][j - 1];

                //compute f_(n+1/2)
                f_h[k] = 0.25 * (ftkv[0] + ftkv[1] + ftkv[2] + ftkv[3]);

                F = 0;

                n = 0;
                f += (e[k][0] * vn[n][0] + e[k][1] * vn[n][1]) * ftkv[n] * dsy[j];
                n = 1;
                F += (e[k][0] * vn[n][0] + e[k][1] * vn[n][1]) * ftkv[n] * dsx[i];
                n = 2;
                F += (e[k][0] * vn[n][0] + e[k][1] * vn[n][1]) * ftkv[n] * dsy[j];
                n = 3;
                F += (e[k][0] * vn[n][0] + e[k][1] * vn[n][1]) * ftkv[n] * dsx[i];

                L_dh[k][i][j] = -st / (dsx[i] * dsy[j]) * F;

                ftk = f_h[k];
                rhot += ftk;
                ut0 += e[k][0] * ftk;
                ut1 += e[k][1] * ftk;
            }

            uu = ut0 * ut0 + ut1 * ut1;

             #pragma acc loop private(eu, k)
            for (k = 0; k < Q; ++k)
            {
                eu = e[k][0] * ut0 + e[k][1] * ut1;
                L_dh[k][i][j] += -st / tau * f_h[k];
            }
        }

    }

where the outer loop i,j represent every mesh point, “f_h[Q]” and “ftkv[4]” are the local temporary arrays for each point (i,j). But the compiler said:

     742, Generating present(f_xbud[:][:][:],f_xblr[:][:][:],L_dt[:][:][:],e[:][:],vn[:][:],dsx[:],dsy[:])
         Loop is parallelizable
         Generating copyout(f_h[:])
    745, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        742, #pragma acc loop gang, vector(2) /* blockIdx.y threadIdx.y */
        745, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
    748, Loop is parallelizable
    779, Loop is parallelizable
    791, Complex loop carried dependence of ->,L_dt prevents parallelization
         Inner sequential loop scheduled on accelerator

see that “Generating copyout(f_h[:])”, I think the compiler treat the “f_h[Q]” as a global array, and the computational results are wrong. How can I define “f_h[Q]” as a local array for each (i,j)?

Besides, Q is small which is less than 10. So I want to tell the compiler that the loop k is no need to parallelise, just run as sequential. I have tried the clause “loop seq” but it seems no use in “kernels”. What can I do if I want to force a loop run as sequential?

Thank you!

MatColgrove · September 5, 2016, 3:05pm

Hi cwu87488,

Give the following schedule a try. To make “f_h” local, you should put the variable in a private clause. To force the inner loops to be sequential, in addition to adding “seq”, let’s set the out loops schedule to “gang, vector”. Note that scalars are private by default so generally there’s no need to put them in a private clause.

-Mat

      #pragma acc parallel loop collapse(2) gang vector private(f_h,ftkv)
     for (i = 1; i <= NX; ++i) 
     { 
         for (j = 1; j <= NY; ++j) 
         { 
             rhot = 0; 
             ut0 = 0; 
             ut1 = 0; 

              #pragma acc loop seq
             for (k = 0; k < Q; ++k) 
...
             #pragma acc loop seq
             for (k = 0; k < Q; ++k)

cwu87488 · September 6, 2016, 2:08am

Thank you Mat, it works!

I note that the message of compiler

    651, Generating present(f_b_p[:][:][:],f_xbud[:][:][:],f_xblr[:][:][:],L_dh[:][:][:],e[:][:],W[:],vn[:][:],dsx[:],dsy[:])
         Accelerator kernel generated
         Generating Tesla code
        651, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
        653,   /* blockIdx.x threadIdx.x collapsed */
    661, Complex loop carried dependence of ftkv prevents parallelization
         Loop carried reuse of ftkv prevents parallelization
    694, Complex loop carried dependence of ->,L_dh prevents parallelization

Though I have added the “acc loop seq” clause, the compiler still warn that. Could it be ignored?

MatColgrove · September 6, 2016, 9:16pm

The compiler’s loop dependency analysis is done before the scheduler and is why you’ll see loop not parallelizable messages even if you have “seq” on the loop. We decided not to suppress these in case you did want to try an parallelize the inner loops even though the compiler isn’t actually parallelizing them.

Mat

cwu87488 · September 7, 2016, 1:47pm

Thanks Mat, it’s helpful!