below is the code for a cholesky decomposition.
As you can see, if called as a single function it is not ideal. In that form, it has its use though, if one wanted to run hundreds of cholesky decompositions in parallel:
The question is now how to turn this function into a gang loop version.
Usually, i can parallelize only the very first loop by turning it into an
#acc parallel loop
The problem seems to be that there is a sequential loop, and the parallelizable loops are inside of this sequential loop. And there it has difficulties writing the
loops which are inside of this seq loop as
#acc parallel loop.
The second problem is then how to call gang versions of the matrix multiplications inside this sequential loop as gang routines.
Unfortunately, i have had only segfaults, or notices like there can not be nested parallelism.
I ve also seen in the compiler messages, nvc++ explicitly to try to parallelize the loop denoted as seq even though i explicitly forbade it by the seq statement to do that.
It is clear that then I am no longer in gang single mode when it tries to parallelize a sequential loop that it should not, and thus it always failed in my attempts.
openacc best practices guide has a section with an openacc data statement where several parallel loops are executed next after each other (which is close to my use case) but it seems nvc++ would not accept this. https://www.openacc.org/sites/default/files/inline-files/OpenACC_Programming_Guide_0_0.pdf
I also tried to write an openmp version. Here, I am blocked by this statement:
If a teams construct is nested within a target construct, that target construct must contain no statements, declarations or directives outside of the teams construct.
Since target statements in openmp can not be nested, this also forbids things like this:
#pragma omp target
{
while(convergence of the dge solver is not reached)
{
#pragma omp teams distribute
parallelizable for loops with tons of linear algebra
{}
#pragma omp teams distribute
parallelizable for loops with even more tons of linear algebra
{}
}
which is, unfortunately, in physics, the most common application of a loop at all! you usually never have a parallelizable loop at the top but usually the parallelizable stuff within a sequential loop.
Apparently, this means that i I am supposed to run the sequential loop of the cholesky decomposition on the host, and then just upload the contents within the sequential loop and write separate acc parallel loops for these or pragma omp target teams distributes for these…
But this is rather un-economical in terms of code style, since i then have parts of code always on the host…
> #pragma acc routine worker
> template <typename T>
> inline void gpu_cholesky_decomposition_w( const datastruct<T>& A, datastruct<T>& L, T*__restrict buffer, size_t step_size)
> {
> const size_t n = A.pextents[0];
> size_t z = 0; // Zero-based indexing, starts at the first column
>
> if(step_size==0)
> step_size=(size_t)pow(n,0.8385);
>
> const size_t tempsize = (n - step_size) * (n - step_size);
>
> size_t pext3[2];
> size_t pstrides3[2];
>
> const size_t nn=n*n;
> // Allocate memory for S on the device
> T* __restrict sdata;
> T* __restrict adata;
>
> if (buffer==(T*) nullptr)
> {
> sdata=new T[tempsize];
> adata=new T[nn];
> }
> else
> {
> sdata=buffer;
> adata=buffer+tempsize;
> }
>
> #pragma acc loop vector independent
> for (size_t i=0; i<nn; i++)
> {
> adata[i]=A.pdata[i];
> L.pdata[i]=0;
> }
>
> datastruct<T> tempA(adata, 0,A.prowmajor,n, n,pext3, pstrides3,true,true);
> const size_t strtA0=tempA.pstrides[0];
> const size_t strtA1=tempA.pstrides[1];
> const size_t strl0=L.pstrides[0];
> const size_t strl1=L.pstrides[1];
> #pragma acc loop seq
> for (size_t c = 0; c < n; ++c)
> {
> if (c == z + step_size)
> {
> size_t u=n-c;
> size_t v=c-z;
> size_t rtext[2];
> size_t rtstrides[2];
>
> size_t pSstrides[2];
> size_t pSext[2];
> size_t pRext[2];
> size_t pRstr[2];
> datastruct<T> R = L.subspanmatrix(c, z, u, v,pRext,pRstr);
> datastruct<T> RT=R.transpose(rtext,rtstrides);
> datastruct<T> S(sdata, 0, A.prowmajor,u, u, pSext, pSstrides,true,true);
> const size_t strs0=S.pstrides[0];
> const size_t strs1=S.pstrides[1];
> // gpu_matrix_multiply_dot_w(R,RT,S);
>
> const size_t rows=R.pextents[0];
> const size_t cols=RT.pextents[1];
> const size_t inner_dim=R.pextents[1];
>
> const size_t strA0=R.pstrides[0];
> const size_t strA1=R.pstrides[1];
>
> const size_t strB0=RT.pstrides[0];
> const size_t strB1=RT.pstrides[1];
>
> const size_t strC0=S.pstrides[0];
> const size_t strC1=S.pstrides[1];
>
> #pragma acc loop independent collapse(2)
> for (size_t i = 0; i < rows; ++i)
> {
> for (size_t j = 0; j < cols; ++j)
> {
> T sum = 0;
> #pragma acc loop vector independent reduction(+: sum)
> for (size_t k = 0; k < inner_dim; ++k)
> {
> sum += R(i,k,strA0,strA1) *RT(k,j,strB0,strB1);
> }
> S(i,j,strC0,strC1)= sum;
> }
> }
>
> const size_t h=c;
> #pragma acc loop worker independent collapse(2)
> for (size_t i = h; i < n; ++i)
> {
> for (size_t j = h; j < n; ++j)
> {
> tempA(i,j,strtA0,strtA1) -=S(i-h,j-h,strs0,strs1);
> }
> }
>
> z = c;
> }
>
> T temp = 0;
> #pragma acc loop vector independent reduction(+:temp)
> for (size_t k = z; k < c; ++k)
> {
> T tmp3=L(c,k,strl0,strl1);
> temp += tmp3*tmp3;
> }
>
> temp=tempA(c,c,strtA0,strtA1)-temp;
> T temp4=sqrt(temp);
> L(c,c,strl0,strl1) = temp4;
>
>
> #pragma acc loop worker independent
> for (size_t i = c + 1; i < n; ++i)
> {
> T temp2 =0;
> #pragma acc loop vector independent reduction(+:temp2)
> for (size_t k = z; k < c; ++k)
> {
> temp2 += L(i,k,strl0,strl1)*L(c,k,strl0,strl1);
> }
> temp2= tempA(i,c,strtA0,strtA1)-temp2;
> L(i,c,strl0,strl1) = temp2 / temp4;
> }
> }
>
> if(buffer==nullptr)
> {
> // acc_free(sdata);
> delete[] sdata;
> delete[] adata;
> }
>
>
> }
Unfortunately, if, eg.g in the internal matrix multiplications i replace the worker loops by acc parallel loops, nvc++ will refuse. probably because they are within a sequential loop,… but that would just mean they are in gang single mode still.
It appears that in this situation nvc++ seems to say something like a loop is a loop, even if sequential, then the single mode stops… This is, as I hope you can understand now, quite unfortunate, not for this, but for many other applications which have parallel loops nested inside sequential loops (which is the most common case in physics)…
I hope that you can see that this is not a problem of nested parallelism, but a problem of nvc++ not recognizing that with the seq statement in:
#pragma acc loop seq
for (size_t c = 0; c < n; ++c)
{
}
the code within the brackets {} should still be in acc single mode.