a 3 levels of loop

Dear Mat,

I have a code with 3levels of loop. I tried to use openACC to accelerate the outside loop as the attached.

#pragma acc kernels copy(l[:N*N],u[:N*N]) copyin(a[:N*N]) local(sum)
for(i=0; i<n-1; i++)
        {
                
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i)
                                        {
                                           sum += l[j][k]*u[k][i];
                                        }
                                }
                                l[j][i] = (float)((a[j][i]-sum)/u[i][i]);
                        }
                }

                
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i+1)
                                        {
                                          sum += l[i+1][k]*u[k][j];
                                        }
                                }
                                u[i+1][j] = (float)((a[i+1][j]-sum));
                        }
                }

        }
but i found the result is not the same with CPU code. and I also try to accelerate the inner loop, but failed. 
Can you give me some suggestions?

Hi Sisiy,

but i found the result is not the same with CPU code

Most likely it’s because you put “sum” in a local clause. This makes it a global variable shared by all threads. Please remove this clause and try again. If that still doesn’t fix it, please post or send me a reproducing example.

and I also try to accelerate the inner loop, but failed.

Do you mean the “j” or “k” loop? The “j” loop should accelerate assuming the compiler hasn’t found some dependency (as indicated in the compiler feedback messages -Minfo=accel). The “k” loops wont accelerate due to the “if” statement. Though, you could use the “parallel” model instead, collapse the i and j loops into a “gang loop” and then parallelize the “k” loops with a “vector loop”. Something along the lines of:

#pragma data copy(l[:N*N],u[:N*N]) copyin(a[:N*N]) 
#pragma acc parallel 
{
#pragma loop collapse(2) gang
for(i=0; i<n-1; i++)
        {
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
#pragma acc loop vector
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i)
                                        {
                                           sum += l[j][k]*u[k][i];
                                        }
                                }
                                l[j][i] = (float)((a[j][i]-sum)/u[i][i]);
                        }
                }
} // end first parallel region
#pragma acc parallel 
{
#pragma loop collapse(2) gang
for(i=0; i<n-1; i++)
        {
#pragma acc loop vector
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i+1)
                                        {
                                          sum += l[i+1][k]*u[k][j];
                                        }
                                }
                                u[i+1][j] = (float)((a[i+1][j]-sum));
                        }
                }
        } 
}  // end second parallel region
}  // end data region

Hope this helps,
Mat